Are photoreceptors doing predictive coding?

Apr 24, 2025

Are photoreceptors doing inference?

What a bizarre proposition! The photoreceptors (rods and cones) are the light-sensing cells in the retina. Despite their relatively simple function, they have the tough task of encoding many log-orders of luminance with a limited dynamic range. Therefore, they adapt both their gain and their kinematics as a function of the recent luminance. Many decades of work have gone into identifying a good functional model of the cones, with a relatively thorough biophysical model of the cones published only recently.

Here, I’m going to explore the possibility that the cones themselves are actually doing inference over a generative model of luminance. What would that look like? Let’s walk through that here.

Before we start, let’s look at how real cones respond to realistic input.

Inference over a Generative Model of Temporal Luminance Signals

Let’s consider generative models of the form $p (I ∣ θ)$ , where $θ$ represents the parameters of the distribution and $I$ is the luminance signal.

So, to restate the problem, if I’m a cone, I want to estimate the paramters $p (θ ∣ I)$ . There are a number of ways we could approach this, but this posterior probability will be intractable because of the normalization constant. Here’s an intuition for that: I’ve seen some light, but to estimate this posterior I need to know the probability of all possible luminance signals, which I can’t know (there might be some light I just can’t imagine).

There are two obvious ways around this, and as you’ll see below, they both lead to the same answer. The first is simply to use the derivative of the likelihood with respect to the parameters (a.k.a. the score function). The second is variational free energy minimization. You’ll see in a minute that under certain assumptions, these two methods are equivalent.

Free Energy Minimization

Let’s start with Free Energy minimization. This is the mathematical quantity in Karl Fristons Grand Unified Theory of life. And it’s a pretty general starting point. It makes sense somehwere else to derive it, but here I’m just going to jump in and start with the definition. The variational Free Energy is defined as:

$F [q] = E_{q} [lo g q (θ) - lo g p (θ, I)]$

What is $q (θ)$ ? It’s a distribution we just get to make up over the parameters. Since we don’t know the true posterior, we’re just going to make one up. Now, what we’d like to do is minimize the KL divergence between our approximate distrubtion $q (θ)$ and the true posterior $p (θ ∣ I)$ . Free Energy minimization is a way to do that without touching $p (θ ∣ I)$ and it forms ais a lower bound on the log-likelihood of the data. We can rewrite the Free Energy into two terms as follows:

$F [q] = - E_{q} [lo g p (I ∣ θ)] + KL (q (θ) ∣∣ p (θ))$

There are many good blog posts on why this quantity is a reasonable thing to minimize, but here the key point is that there are three distributions we need to choose: $q (θ)$ , $p (θ)$ , and $p (I ∣ θ)$ . Let’s start with $q (θ)$ . One of the simplest things we can choose is a delta posterior $q (θ) = δ (θ - θ_{t})$ , which is a point mass at the current estimate of the parameters. If we assume a delta posterior, several important simplifications occur:

The expectation $E_{q} [f (θ)]$ of any function $f (θ)$ simplifies to $f (θ_{t})$
The KL divergence term $KL (q (θ) ∣∣ p (θ))$ simplifies to $- lo g p (θ_{t})$ for $p (θ_{t}) > 0$

Substituting the delta posterior into the Free Energy and assuming $p (θ_{t}) > 0$ we get:

$F [δ (θ - θ_{t})] = - lo g p (I ∣ θ_{t}) + KL (δ (θ - θ_{t}) ∣∣ p (θ)) = - lo g p (I ∣ θ_{t}) - lo g p (θ_{t}) = - lo g p (I, θ_{t})$

If we further assume a uniform (or flat) prior $p (θ) \propto constant$ , then:

$F = - lo g p (I ∣ θ_{t}) + constant$

So if we want to minimize Free Energy, we can simply step along the gradient of negative log-likelihood w.r.t. the parameters, which is exactly the score function.

So, why did I bother with all that if I’m just going to maximize the likelihood? Because we could have chosen different distributions and ended somewhere else, and that’s interesting. Free Energy minimization is incredibly flexible, but its worth understanding that it can reduce to familiar things like maximum-likelihood.

In the next section I’ll walk through a simple example with Gaussian distributed luminance, and then I’ll move onto a more physically realistic model.

Gaussian Luminance Model

Okay, let’s say the temporal luminance signal $I (t)$ is drawn from a Gaussian distribution: $I (t) \sim N (μ (t), σ^{2} (t))$

The probability density function is:

$p (I ∣ μ, σ) = \frac{1}{2 π σ ^{2}} exp (- \frac{( I - μ ) ^{2}}{2 σ ^{2}})$

and the log of that is:

$lo g p (I ∣ μ, σ) = - \frac{1}{2} lo g (2 π) - lo g (σ) - \frac{( I - μ ) ^{2}}{2 σ ^{2}}$

The score function is the gradient of the log-likelihood with respect to the parameters. For $μ$ :

\frac{\partial}{\partial μ} lo g p (I ∣ μ, σ) = \frac{\partial}{\partial μ} [- \frac{( I - μ ) ^{2}}{2 σ ^{2}}] = \frac{\partial}{\partial μ} [- \frac{I ^{2} - 2 I μ + μ ^{2}}{2 σ ^{2}}] = \frac{2 I - 2 μ}{2 σ ^{2}} = \frac{I - μ}{σ ^{2}}

So, the update rule for $μ$ is:

$μ (t + 1) = μ (t) + η \frac{I ( t ) - μ ( t )}{σ ^{2} ( t )}$

Let’s repeat for $σ$ . For numerical stability, let’s work with $lo g (σ)$ :

\frac{\partial}{\partial lo g ( σ )} lo g p (I ∣ μ, σ) = \frac{\partial}{\partial lo g ( σ )} [- lo g (σ) - \frac{( I - μ ) ^{2}}{2 σ ^{2}}] = \frac{\partial}{\partial lo g ( σ )} [- lo g (σ) - \frac{( I - μ ) ^{2}}{2 σ ^{2}}] = - 1 + \frac{( I - μ ) ^{2}}{σ ^{2}}

So, the update rule for $lo g (σ)$ is:

$lo g (σ (t + 1)) = lo g (σ (t)) + η [\frac{( I ( t ) - μ ( t ) ) ^{2}}{σ ^{2} ( t )} - 1]$

so to get $σ (t + 1)$ we can exponentiate both sides:

$σ (t + 1) = σ (t) exp [η (\frac{( I ( t ) - μ ( t ) ) ^{2}}{σ ^{2} ( t )} - 1)]$

So that’s it, let’s assume our cones are doing inference over this generative model, and we’ll assume they’ll report something like the prediction error. In this case the score function for $μ (t)$ :

$\partial_{μ} F = \frac{I ( t ) - μ ( t )}{σ ^{2} ( t )}$

What does this look like?

todo: have a figure

More realistic distributions

Light is not Gaussian. Let’s try to make a physically realistic distribution. Photons are poisson distributed given a fixed light level. If light levels are fluctuating locally at all (as they always do in the real world), then light will be a rate-modulated Poisson proces, captured by a Negative Binomial distribution.

Interestingly, my graduate student recently measured the noise in event cameras and found that negative binomial noise was a good approximation of the noise in static scenes measured with high frame rate event cameras

Negative Binomial Distribution Cones

The Negative Binomial distribution models count data with overdispersion. For our temporal luminance signal $I (t)$ :

$I (t) \sim NB (μ (t), r (t))$

Where $μ$ is the mean and $r$ is the dispersion parameter. The variance is $μ + \frac{μ ^{2}}{r}$ .

The probability mass function is:

$p (I ∣ μ, r) = \frac{Γ ( I + r )}{Γ ( I + 1 ) Γ ( r )} (\frac{r}{r + μ})^{r} (\frac{μ}{r + μ})^{I}$

where $Γ$ is the gamma function.

The log-likelihood is:

lo g p (I ∣ μ, r) = lo g (\frac{Γ ( I + r )}{Γ ( I + 1 ) Γ ( r )} (\frac{r}{r + μ})^{r} (\frac{μ}{r + μ})^{I}) = lo g Γ (I + r) - lo g Γ (I + 1) - lo g Γ (r) + r lo g (\frac{r}{r + μ}) + I lo g (\frac{μ}{r + μ}) = lo g Γ (I + r) - lo g Γ (I + 1) - lo g Γ (r) + r lo g (r) - r lo g (r + μ) + I lo g (μ) - I lo g (r + μ) = lo g Γ (I + r) - lo g Γ (I + 1) - lo g Γ (r) + r lo g (r) + I lo g (μ) - (I + r) lo g (r + μ)

The score function with respect to $μ$ is:

\frac{\partial}{\partial μ} lo g p (I ∣ μ, r) = \frac{\partial}{\partial μ} [I lo g (μ) - (I + r) lo g (r + μ)] = \frac{I}{μ} - \frac{I + r}{r + μ}

and the score function with respect to $r$ is:

\frac{\partial}{\partial r} lo g p (I ∣ μ, r) = \frac{\partial}{\partial r} [r lo g (r) - (I + r) lo g (r + μ)] = lo g (r) + 1 - lo g (r + μ) - \frac{I + r}{r + μ}

The gradient of negative Free Energy with respect to $μ$ is what our cones will spit out:

$\frac{\partial F}{\partial μ} = \frac{I}{μ} - \frac{I + r}{r + μ}$

So what does this look like?

Negative Binomial Predictive Coding Cones