Massive Predictive Coding

Feb 6, 2026

§1 — The Standard Model (Massless)

Consider a 2-layer predictive coding network receiving sensory input $x (t)$ . Layer 1 state $u_{1}$ explains the input; layer 2 state $u_{2}$ provides a prior on $u_{1}$ . The generative model says: the predicted input is $θ \cdot u_{1}$ , and the predicted $u_{1}$ is $u_{2}$ .

The prediction errors are:

ε_{1} = \frac{x - θ \cdot u _{1}}{σ _{1}^{2}} \leftarrow sensory error

ε_{2} = \frac{u _{1} - u _{2}}{σ _{2}^{2}} \leftarrow higher-level error

The states update by gradient descent on variational free energy $F$ . This gives first-order (massless) dynamics:

\frac{d u _{1}}{d t} = θ \cdot ε_{1} - ε_{2} \leftarrow bottom-up error minus top-down error

\frac{d u _{2}}{d t} = ε_{2} \leftarrow driven by mismatch with u_{1}

These are first-order ODEs. Velocity $d u / d t$ is fully determined by the current state. No memory of past velocity. No overshoot. A step input produces an exponential rise — a lowpass filter.

§2 — Adding Mass (The Massive Variant)

Now suppose each state has inertia. Physically: the membrane has capacitance that interacts with slow recovery currents, giving second-order dynamics. We replace the first-order rule with:

m \cdot \frac{d ^{2} u _{1}}{d t ^{2}} + γ \cdot \frac{d u _{1}}{d t} = θ \cdot ε_{1} - ε_{2}

m \cdot \frac{d ^{2} u _{2}}{d t ^{2}} + γ \cdot \frac{d u _{2}}{d t} = ε_{2}

The right-hand side is identical — same free energy, same prediction errors, same objective. The only change is the left-hand side: we’ve added an acceleration term $m \cdot \overset{u}{¨}$ . This is a choice about the optimizer, not the objective.

The damping ratio $ζ = γ / (2 m \cdot k)$ , where $k$ is the local curvature of $F$ , determines the character:

ζ > 1 \to Overdamped (recovers standard PC)

ζ = 1 \to Critically damped

ζ < 1 \to Underdamped (overshoot, ringing)

§3 — What Changes?

In the massless model, the prediction error is the thing that looks like a neuron — it has a transient response, it can be positive or negative, it signals surprise. That’s why people hunt for “prediction error neurons.”

In the massive model, the state itself overshoots, rings, and shows onset transients. The velocity $d u / d t$ — which is just the time derivative of the membrane potential, not a separate variable — spikes at onset and adapts. This looks like a real neuron without needing to be reinterpreted as an error unit.

The prediction error $ε$ is still there, but it’s just a force. It doesn’t need its own neural population.

§4 — Simulation

Below: a step stimulus $x (t)$ that turns on at $t = 1$ and off at $t = 3.5$ . Compare the massless and massive responses. Adjust mass and damping to see the effect.

Parameters ζ ≈ 1.47 → Overdamped

mass (m)0.080

friction (γ)1.00

θ (weight)1.00

σ₁ / σ₂1.0 / 1.5

§5 — The Key Insight

Look at the velocity plot ( $d u_{1} / d t$ ) for the massive system. It spikes at onset, overshoots, rings, and adapts. It has a sharp transient followed by decay — classic high-pass behavior. This is what real neurons look like in response to a step.

In the massless model, $d u_{1} / d t$ is just equal to the force — it mirrors the prediction error exactly. There’s no distinction between “velocity” and “error.” So we need separate error neurons.

In the massive model, $d u_{1} / d t$ has its own dynamics. It’s not slaved to the current error. The state overshoots, which means the velocity reverses sign even while the error is still positive. This decoupling is what gives the response its neural character.

Bottom line: if neurons have inertia (from slow currents, adaptation, etc.), then the “prediction error neuron” might just be a regular neuron whose transient response is the error signal — encoded in the dynamics, not in a separate population.