Massive Predictive Coding
§1 — The Standard Model (Massless)
Consider a 2-layer predictive coding network receiving sensory input . Layer 1 state explains the input; layer 2 state provides a prior on . The generative model says: the predicted input is , and the predicted is .
The prediction errors are:
The states update by gradient descent on variational free energy . This gives first-order (massless) dynamics:
These are first-order ODEs. Velocity is fully determined by the current state. No memory of past velocity. No overshoot. A step input produces an exponential rise — a lowpass filter.
§2 — Adding Mass (The Massive Variant)
Now suppose each state has inertia. Physically: the membrane has capacitance that interacts with slow recovery currents, giving second-order dynamics. We replace the first-order rule with:
The right-hand side is identical — same free energy, same prediction errors, same objective. The only change is the left-hand side: we’ve added an acceleration term . This is a choice about the optimizer, not the objective.
The damping ratio , where is the local curvature of , determines the character:
§3 — What Changes?
In the massless model, the prediction error is the thing that looks like a neuron — it has a transient response, it can be positive or negative, it signals surprise. That’s why people hunt for “prediction error neurons.”
In the massive model, the state itself overshoots, rings, and shows onset transients. The velocity — which is just the time derivative of the membrane potential, not a separate variable — spikes at onset and adapts. This looks like a real neuron without needing to be reinterpreted as an error unit.
The prediction error is still there, but it’s just a force. It doesn’t need its own neural population.
§4 — Simulation
Below: a step stimulus that turns on at and off at . Compare the massless and massive responses. Adjust mass and damping to see the effect.
§5 — The Key Insight
Look at the velocity plot () for the massive system. It spikes at onset, overshoots, rings, and adapts. It has a sharp transient followed by decay — classic high-pass behavior. This is what real neurons look like in response to a step.
In the massless model, is just equal to the force — it mirrors the prediction error exactly. There’s no distinction between “velocity” and “error.” So we need separate error neurons.
In the massive model, has its own dynamics. It’s not slaved to the current error. The state overshoots, which means the velocity reverses sign even while the error is still positive. This decoupling is what gives the response its neural character.
Bottom line: if neurons have inertia (from slow currents, adaptation, etc.), then the “prediction error neuron” might just be a regular neuron whose transient response is the error signal — encoded in the dynamics, not in a separate population.