How I learned to stop worrying and love the Free Energy Principle

Aug 19, 2025

In my lab, we’ve been thinking a lot about how the brain learns a model of the world and how it uses it to act intelligently. This seems to be the crux of intelligence.

How do brains do this? If we understood this completely, we could build a brain. A brain like ours. That would be pretty cool. At that point, I’d have to close up shop and move on to some other research direction. Interestingly, out of all the theories in neuroscience, very few could ever be used to actually build a brain – and most don’t even pretend they could! However, there is one that at least claims to. It’s called the Free Energy Principle. If you haven’t heard of it, you should watch this nice animated youtube short that introduces it quite well.

The Free Energy Principle (FEP) is a so-called “general theory” and a lot of neuroscientists I know don’t like it. There are two main types of criticisms I’ve heard: Either people can’t make any sense of it, or it’s “unfalsifiable” and they aren’t sure if it is any more useful than Bayesian inference or some more familiar theory.

This blog post is meant to be a gentle introduction to the math behind the FEP. We’re going to start really simple and find that it stays pretty simple, but it gives us nice mathematical quantities for talking about concepts we are interested in. In the same way that Bayesian inference has been a powerful tool for formalizing certain perceptual phenomena, FEP provides a few additional quantities that are useful for talking about what brains do. Moreover, we’re going to find that one quantity, the KL Divergence, is at the heart of it all. For a related (and genuinely more thoughtful) post, check out Hadi’s blog here. Finally, we’re going to stop short of active inference here and punt entirely on action. I’ll revisit that in a future blog post, but this will remain a good primer for that.

Note: The title here is a reference to a dark comedy from the height of the cold war called “Dr. Strangelove Or: how I learned to stop worrying and love the bomb”. I haven’t seen the movie since I was a kid, and I don’t remember anything except the line “not only is it possible. it is essential.” and that’s how I now feel about the KL Divergence.

Left is the original movie poster. Right is the new and modified Dr. Free Energy, or: how I learned to stop worrying and love the Free Energy Principle. Karl Friston on top, and the "cybernetics seance" from the Macy conference (with Norbert Weiner, John von Neumann, Walter Pitts, Margaret Mead and others) on the bottom.

The problem

Let’s start out with some solid ground: there IS a world out there. And the brain just gets samples from it through the senses. I’m going to massively over-simplify the problem in this blog post (don’t worry, it still works even if we don’t do this), but let’s pretend the brain is passive and only receives sensory samples, $X \sim p_{world}$ .

$p_{world}$ is the true generating processs of the world for our passive brain. The brain does not have access the true generating process, $p_{world}$ , and so it cannot evaluate the density $p_{world} (x)$ even though it can get samples from it. For now, we’re going to assume sammples are i.i.d. (ignoring actions/dynamics). That is a very incorrect assumption for brains, but it makes the math easier, so we’re going to stick with it for now to gain some intuitions.

The world generates data. The brain can only sample this data and must adjust its own internal model to match. In all cases, the brain can only evaluate its own model density p_brain.

I’m going to be careful to keep track of what is in the world and what is in the brain, and will keep the notation clear. The brain can’t evaluate $p_{world} (x)$ , because that’s literally the physics of the world, but it can evaluate its own model of it $p_{brain} (x)$ for any $x \in X$ . Let’s say the brain’s model has parameters, $θ$ , so we write it as $p_{θ} (x)$ . How do you fit a good $p_{θ} (x)$ to $p_{world} (x)$ given only samples? That’s what we’re going to solve here.

What follows is a simple walkthrough in notation that I like, but is unusual in the active inference literature. I think you’ll find that the math lends itself quite nicely to talking about what the brain is doing.

Setup

This section lays out all the math facts that you need for all the derivations that come.

1) Log-likelihood

Suppose we have the brain’s model, $p_{θ} (x)$ , and i.i.d. samples from the world, $x_{1}, \dots, x_{n} \sim p_{world}$ . The likelihood is just the probability density evaluated at the observed data combined across all samples:

L (θ) = i = 1 \prod n p_{θ} (x_{i}) .

Given fixed data, the likelihood is a function of the parameters $θ$ . The reason we can multiply them all together is the i.i.d. assumption I made above. Because multiplying many small numbers quickly becomes impractical, we usually work with the log-likelihood, here the average log-likelihood or per-sample log-likelihood:

ℓ_{n} (θ) = \frac{1}{n} i = 1 \sum n lo g p_{θ} (x_{i}) .

Intuition

Think of the log-likelihood as a surprise meter with the sign flipped. If your model assigns high probability to what actually happened, the log-likelihood is high; if it assigns low probability, the log-likelihood is very negative. So if an improbable event happens, your negative log-likelihood spikes positive. Therefore, minimizing the negative log-likelihood, or equivalently, maximizing the log-likelihood, is just trying to reduce your surprise across many observations.

2) Expectations and the Law of Large Numbers

An Expectation is the average quantity you would get from many draws under the data-generating law, $p (x)$ . The expectation of a function $f$ is

E_{x \sim p} [f (x)] = \int f (x) p (x) d x (sum if x is discrete).

If $x_{1}, \dots, x_{n} \sim iid p$ , then by the law of large numbers

\frac{1}{n} i = 1 \sum n f (x_{i}) a.s. E_{x \sim p} [f (x)] .

If we apply this to our per-sample log-likelihood above, we get:

ℓ_{n} (θ) = \frac{1}{n} i = 1 \sum n lo g p_{θ} (X_{i}) a.s. E_{x \sim p} [lo g p_{θ} (x)] .

Intuition: averaging samples from $p$ weights values by how often they occur. Points with larger $p (x)$ show up more and pull the average toward them, so the average of log-likelihood of the samples is a weighted average (weighted by $p (x)$ ) and converges to the expectation with enough samples.

Importantly: the expectation is w.r.t. $p$ , the true generating process of the data. Our model log likelihood is evaluating $p_{θ} (x)$ , but it’s evaluated at samples from $p$ .

3) KL divergence

The Kullback–Leibler (KL) divergence will emerge as a quantity of major interest. Here, I’m going to introduce it using it’s definition, but I really want to emphasize that it emerges in the derivation below. I want to define it here so we’re prepared to recognize it when it shows up, so we can interpret it accordingly. The KL divergence measures the mismatch between two distributions — in this case, the true world distribution $P_{world}$ and the brain’s model $P_{brain}$ . Its definition is:

D_{KL} (P_{world} ∥ P_{brain}) = E_{X \sim P_{world}} [lo g \frac{p _{world} ( X )}{p _{brain} ( X )}] .

Intuition: KL as an expected log-likelihood ratio

Let’s go back to basic statistics. For a single sample $x$ , we know how to compare two models, $P$ and $Q$ : we use the log-likelihood ratio: $lo g \frac{p ( x )}{q ( x )}$ . If we had many samples, we would average the log-likelihood ratio across all samples. With a large enough number of samples, the average converges to the expectation (that’s just point 2 above).

Now, what happens if we’re sampling from $P$ and then evaluating the log-likelihood ratio compared to $Q$ ? Well, then we get the KL divergence! It’s an expected log-likelihood ratio when you’re sampling from the first density. It’s bounded at $0$ . It has to be positive. And what it tell us is how much extra “surprise” we get when we use $Q$ (the wrong model) to explain data generated by $P$ .

For our problem, if the brain’s internal model matches the world perfectly, the KL is 0 — no extra surprise. But the more the brain’s predictions diverge from the world’s samples, the larger the KL becomes. In other words, KL measures the cost of pretending the brain’s model generated the data when in fact it came from the world. This is exactly what we want to minimize. Small KL is the mathematical equivelent of saying the “the brain’s model fits the world well”.

Properties worth knowing

$D_{KL} (P ∥ Q) \geq 0$ , and it’s 0 iff $P = Q$ .
It’s asymmetric: $D_{KL} (P ∥ Q) \neq = D_{KL} (Q ∥ P)$ .
Connection to cross-entropy: $D_{KL} (P ∥ p_{θ}) = H (P, p_{θ}) - H (P),$ where $H (P, p_{θ}) = E_{P} [- lo g p_{θ} (X)]$ is the cross-entropy.

Maximum likelihood is KL minimization

Given fixed data, the log-likelihood is a function of the model parameters $θ$ , and all it tells you is the log probability of each sample. If we maximize the log-likelihood (or equivalently, minimize the negative log-likelihood), what happens? This is a classic technique in statistics called maximum likelihood estimation, and we’re going to walk through a visual example of what it looks like when the average log-likelihood is maximized.

Below is a simple example with a true generating distribution that is a 2D Gaussian. The data, $x_{i}$ , are shown in red. Our model is also a 2D Gaussian and is shown in blue. The model, $p_{θ} (x) = N (μ, Σ)$ , with parameters $θ = (μ, Σ)$ . Withough knowing the true parameters, we can evaluate the average log-likelihood of the data under our model, and we can adjust our parameters to maximize this quantity by stepping along the derivative with respect to the parameters. At first, our model is not overlapping with the data. You can hit “Run Optimization” to see what happens if we simply step along the gradient (derivative) of the log-likelihood.

Data Points & Model Distribution

Average Log-Likelihood vs Steps

Number of Data Points:

What happened? Well, our parameters converged to the true parameters that generated the data. Try hitting reset and running it a few times. The blue density (our model) starts in a random location and orientation, but it always converges to the true distribution. You can also see the parameters converge. I've listed the true mean and the model estimate.

It’s fun to watch the animation, and is a totally standard method in statistics. But did you ever stop to ask why this works? Why does assigning high probability to likely samples make our model fit the true generating distribution?

What is happening when we maximize likelihood? Or, equivalently, when we minimize negative log-likelihood?

Let’s rearrange some terms using the fun facts from above and see what happens. With enough samples, we can use point 2 above. With a large number of samples, the average negative log-likelihood converges to the expected negative log-likelihood (NLL) under the true world distribution:

E_{x \sim p_{world}} [- lo g p_{θ} (x)] .

Make it equal to itself and add and subtract $E_{x \sim p_{world}} [- lo g p_{world} (x)]$ :

E_{x \sim p_{world}} [- lo g p_{θ} (x)] = E_{x \sim p_{world}} [- lo g p_{θ} (x)] + E_{x \sim p_{world}} [- lo g p_{world} (x)] - E_{x \sim p_{world}} [- lo g p_{world} (x)]

Now we can combine the first two terms using the log rule $lo g a - lo g b = lo g \frac{a}{b}$ :

= E_{x \sim p_{world}} [- lo g \frac{p _{θ} ( x )}{p _{world} ( x )} - lo g p_{world} (x)]

And then split the expectation and simply recognize the KL divergence and the entropy of the world:

= world entropy H (p_{world}) E_{x \sim p_{world}} [- lo g p_{world} (x)] + D_{KL} (p_{world} ∥ p_{θ}) E_{x \sim p_{world}} [lo g \frac{p _{world} ( x )}{p _{θ} ( x )}] .

So there you have it. We started with the average log-likelihood and rerranged the terms and it revealed that minimizing the negative log-likelihood is equivalent to minimizing the KL divergence between the world and the brain’s model.

expected NLL E_{X \sim P_{world}} [- lo g p_{θ} (X)] = world entropy (constant in θ) H (P_{world}) + model mismatch (world ∥ brain) D_{KL} (P_{world} ∥ p_{θ}) .

The first term (entropy of the world) does not depend on $θ$ . Therefore, minimizing expected NLL is the same as minimizing the KL divergence between $P_{world}$ and $P_{brain}$ !

That simple result is satisfying: Maximizing likelihood makes the brain’s model assign high probability to what the world actually produces AND it tunes $P_{brain}$ to get as close as possible to $P_{world}$ . This intuition helps unpack Alex Alemi’s wonderful blog post on KL divergences and how central they are. Maximum likelihood is just a special case of KL minimization. Additionally, this is not true for any arbitrary distance metric or divergence. As far as I know, the KL divergence has a priveleged status.

Hidden causes: the brain’s internal variables

So far we treated the brain’s model as a direct mapping from observations to probabilities, $p_{θ} (x) = p_{brain} (x)$ . That’s too simple, because the brain needs to be able to flexibly encode the state of the world in terms of hidden causes. Let’s call these hidden causes $Z$ . Importantly, these are not the “real” physical causes; they’re useful internal variables the brain uses to understand the world.

All samples of data come from the true generating process p_world. The brain wants to explain these samples using a generative model with hidden causes, z. The brain can only sample from p_world, but it does not evaluate it. To fit the data well, it must adjust the parameters of its internal model as well as infer the latent causes. Importantly all these causes are in the brain NOT the world. The density that generates samples is the world. The densities used for inference are in the brain.

The brain’s model says: first sample a hidden cause $Z$ from a prior, then generate an observation $X$ from a likelihood.

p_{brain} (x, z) = p_{θ} (x) = \int p_{θ} (x, z) d z .

Here we have the static parameters of the brain $θ$ , which could map onto the weights of a neural network (or the synapses in a brain). And we have the latent variables $Z$ , which are the internal latent variables the brain uses to understand the world (which could map onto the activations or spikes of neurons).

Two things we want to do with this model:

Learning (adapt the world’s statistics): make $p_{brain} (x)$ match $P_{world} (x)$ as well as possible using samples $X \sim P_{world}$ .
Inference (adapt to the sampled data): given an observation $x$ , infer its hidden causes via the posterior $p_{brain} (z ∣ x) = \frac{p _{brain} ( x ∣ z ) p _{brain} ( z )}{p _{brain} ( x )} .$

This is just Bayes rule and it maps nicely on to words we use to describe perception. But, I want to spend a moment to belabor a recurring issue in perceptual psychology.The way Bayesian inference is typically introduced in perception is that the brain is inferring the causes in “the world” from “the senses”.

In a typical Bayesian Brain experiment, psychologists and neuroscientists will test to see if the subject has behavior that looks like Bayesian inference over the parameters of their experiment. For example, they might show drifting motion where some motions are more likely than others and see if the subjectls learn to integrate the prior probabilities of the motion (in the experiment) with the incoming visual evidence.

But that’s NOT what is happening here. $Z$ are in $P_{brain}$ . They are NOT in $P_{world}$ . The experiment is part of $P_{world}$ , but $Z$ are not. To learn to act intelligently in the world, $Z$ likely have high mutual information between relevant groundtruth causes in the world – the intuitive physics level, but no more. The key point is to remember that $Z$ are not the “real” causes in the world (or your experiment). They are the brain’s internal representation of the world.

Now, even though $Z$ are just causes that the brain made up, that denominator $p_{brain} (x) = \int p_{brain} (x, z) d z$ is still usually intractable, which makes the exact posterior $p_{brain} (z ∣ x)$ intractable too.

We’re going to get around this by inventing a density we can evaluate and just try to optimize the parameters for that. We just invent a density we can evaluate, $q_{ϕ} (z ∣ x)$ . Is this even reasonable? In the next section we’ll see that it is and why it works is quite satisfying.

The world still generates data. All samples come from p_world. The brain wants to learn the hidden causes in its generative model of the data. Again, it can only sample from p_world, but it does not evaluate it and z all live in p_brain. The brain adjusts its parameters, theta, and infers the latent causes, z. To do this, it uses a variational approximation to the true posterior. This is a distribution it knows how to evaluate and avoides the intractable integral.

Deriving the Evidence Lower Bound (a.k.a. Free Energy)

To get around the intractable posterior, we just invent a density we can evaluate, $q_{ϕ} (z ∣ x)$ . This is the backbone of variational inference, and what we’re going to do here is derive a quantity known in machine learning as the Evidence Lower Bound (ELBO). The ELBO is typically derived using something called Jensen’s inequality, so I’ll show that first, but then we’ll do it without it to see what we were missing.

With Jensen’s inequality (lower bound)

Jensen’s inequality Jensen’s inequality says that the average of a logarithm is always less than (or equal to) the logarithm of the average:

lo g E [Y] \geq E [lo g Y]

It’s often explained in terms related to concavity, but if you think about the shape of the logarithm, it’s pretty intuitive. The logarithm is compressive for large values and explosive near zero, so a few tiny $Y$ values pull the average of logs way down. If you average first, those tiny values are cushioned before taking the log.

Here’s the derivation you see in most places: Start from the model evidence:

lo g p_{θ} (x) = lo g \int p_{θ} (x, z) d z = lo g \int q_{ϕ} (z ∣ x) \frac{p _{θ} ( x , z )}{q _{ϕ} ( z ∣ x )} d z .

Apply Jensen’s inequality (log of an expectation ≥ expectation of the log):

lo g p_{θ} (x) \geq E_{q_{ϕ}} [lo g \frac{p _{θ} ( x , z )}{q _{ϕ} ( z ∣ x )}]

And we’re done! Call that thing the ELBO:

ELBO (x; θ, ϕ) := E_{q_{ϕ}} [lo g p_{θ} (x, z)] - E_{q_{ϕ}} [lo g q_{ϕ} (z ∣ x)] = E_{q_{ϕ}} [lo g p_{θ} (x ∣ z)] - D_{KL} (q_{ϕ} (z ∣ x) ∥ p_{θ} (z)) .

Importantly, the (variational) Free Energy is $F (x; θ, ϕ) := - ELBO (x; θ, ϕ)$ :

F (x; θ, ϕ) = E_{q_{ϕ}} [lo g q_{ϕ} (z ∣ x) - lo g p_{θ} (x, z)] .

There you go! It’s not that mystical how to get there. But what is it good for?

Well, we can see from the inequality that it’s a bound on the “model evidence”… what we were calling log-likelihood at the top, $p_{θ} (x)$

But what disappeared in that inequality? What are we actually doing when we minimize Free Energy?

Let’s rederive without Jensen and see what we’re missing.

Without Jensen (the exact identity)

Start with any tractable density $q_{ϕ} (z ∣ x)$ (whose integral is 1):

lo g p_{θ} (x) = lo g p_{θ} (x) \int q_{ϕ} (z ∣ x) d z (insert 1 = \int q_{ϕ}) = \int q_{ϕ} (z ∣ x) lo g p_{θ} (x) d z (move constant inside) = E_{z \sim q_{ϕ} (z ∣ x)} [lo g p_{θ} (x)] = E_{q_{ϕ}} [lo g p_{θ} (x) + lo g p_{θ} (z ∣ x) - lo g p_{θ} (z ∣ x)] (add and subtract) = E_{q_{ϕ}} [lo g \frac{p _{θ} ( x ) p _{θ} ( z ∣ x )}{p _{θ} ( z ∣ x )}] = E_{q_{ϕ}} [lo g \frac{p _{θ} ( x , z )}{p _{θ} ( z ∣ x )}] (Bayes: p_{θ} (x) p_{θ} (z ∣ x) = p_{θ} (x, z)) = E_{q_{ϕ}} [lo g \frac{p _{θ} ( x , z ) q _{ϕ} ( z ∣ x )}{p _{θ} ( z ∣ x ) q _{ϕ} ( z ∣ x )}] (multiply/divide by q_{ϕ}) = E_{q_{ϕ}} [lo g \frac{p _{θ} ( x , z )}{q _{ϕ} ( z ∣ x )}] + E_{q_{ϕ}} [lo g \frac{q _{ϕ} ( z ∣ x )}{p _{θ} ( z ∣ x )}] = ELBO (x; θ, ϕ) E_{q_{ϕ}} [lo g \frac{p _{θ} ( x , z )}{q _{ϕ} ( z ∣ x )}] + inference gap \geq 0 D_{KL} (q_{ϕ} (z ∣ x) ∥ p_{θ} (z ∣ x)) .

Therefore,

lo g p_{θ} (x) = ELBO (x; θ, ϕ) + D_{KL} (q_{ϕ} (z ∣ x) ∥ p_{θ} (z ∣ x)) .

What does this mean?

Well, now we can see clearly what disappeared in the Jensen’s inequality derivation: the KL divergence between the approximate posterior and the true posterior. This is not a bound anymore. It’s an exact identity. Now, because the KL is always $\geq 0$ , if we maximize ELBO using the varational posterior parameters $ϕ$ , we are guaranteed to minimize the KL divergence between the approximate posterior and the true posterior. If we maximize ELBO using the model parameters $θ$ , we can push up the model evidence and mimimize the KL between the world and the brain’s model. Thus, maximizing ELBO or minimizing Free Energy minimizing is particularly useful during inference for minimizing the KL between the variational posterior and the true posterior.

Putting it all together: What about learning? How do we make the brain’s model match the world?

So far we showed that maximum likelihood can be interpreted as minimizing the KL divergence between the world and the brain’s model. We also showed that minimizing Free Energy is equivalent to minimizing the KL divergence between the variational posterior and the true posterior. Now we’re going to combine them both to see what minimizing Free Energy is really doing.

First, remember that Free Energy is the negative ELBO:

F (x; θ, ϕ) = - ELBO (x; θ, ϕ) .

so,

F (x; θ, ϕ) = - lo g p_{θ} (x) + D_{KL} (q_{ϕ} (z ∣ x) ∥ p_{θ} (z ∣ x)) .

Second, remember our trick up above that minimizing the negative log-likelihood is the same as minimizing the KL divergence between the world and the brain’s model:

expected NLL E_{X \sim P_{world}} [- lo g p_{θ} (X)] = world entropy (constant in θ) H (P_{world}) + model mismatch (world ∥ brain) D_{KL} (P_{world} ∥ p_{θ}) .

Let’s put these together and see what happens when we minimize the Free Energy:

E_{X \sim P_{world}} [F (X; θ, ϕ)] = constant in θ, ϕ H (P_{world}) + model mismatch (world ∥ brain) D_{KL} (P_{world} ∥ p_{θ}) + inference mismatch (q ∥ true brain posterior) E_{X \sim P_{world}} [D_{KL} (q_{ϕ} (z ∣ X) ∥ p_{θ} (z ∣ X))]

Conclusion (what minimizing Free Energy does):

Improves the generative model (minimizes $D_{KL} (p_{world} ∥ p_{θ})$ ).
Improves inference (minimizes the expected $D_{KL} (q_{ϕ} (z ∣ X) ∥ p_{θ} (z ∣ X))$ ).
The world’s entropy is constant in $θ$ and $ϕ$ — you can’t change physics; you can only make your brain model and inference better.

Conclusion

In this blog post, we derived the free energy principle from first principles. We learned that simply trying to assign high probability to probable events is equivalent to making the brain’s model fit the world well (by minimizing KL divergence). We learned that the Evidence Lower Bound (ELBO) is pretty easy to derive, even without Jensen’s inequality and that it leads to an exact identity rather than a bound, where maximizing ELBO is actually minimizing two intractable KLs that we’re really interested in minimizing. The KL Divergence emerges as a metric of how good our models are in two places. The KL between the world and the brain’s model tells us how well the brain’s model fits the world (Learning). The KL between the variational posterior and the true posterior tells us how well the brain’s inference matches the true posterior (Inference). Minimizing Free Energy is equivalent to minimizing both of these KLS! And the KL is really the mother principle!