A recipe for representations that support intelligence
Moving beyond the Bitter Lesson
Richard Sutton’s Bitter Lesson has been one of the most influential essays in AI. It distilled decades of experience into a simple truth: the methods that scale best are the ones that learn, not the ones that encode our intuitions. Sutton argued that over time, hand-designed representations always lose to general-purpose learning and search. This has been highly impactful, with many AI researchers – especially those at frontier labs – subscribing to a []“Bitter Lesson pilled”](https://x.com/karpathy/status/1973435013875314729?lang=en) approach to AI.
But the final paragraph of The Bitter Lesson walks an internally contradictory tightrope that is hard to make coherent. I’ve reproduced it in full here so you can see that I’m not misrepresenting it:
“The second general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries. All these are part of the arbitrary, intrinsically-complex, outside world. They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity. Essential to these methods is that they can find good approximations, but the search for them should be by our methods, not by us. We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done.” [emphasis is mine]
Sutton wants agents that learn like we do, yet insists we abandon any attempt to understand or embed the structures that make such learning possible. Now that Sutton is making the rounds arguing that LLMs do not qualify as the Bitter Lesson, because they don’t “learn like we do”, it seems reasonable to take stock of what like we do means.
Agents that learn “like we do”
The most concrete lesson neuroscience and psychology has to offer is that biological brains don’t start tabula rasa. They have strong inductive biases. Different species build in different amounts of these inductive biases, which are encoded in the genome. Tony Zador has a nice review on this idea here, where he argues that a lot of biological “learning” is evolutionary and encoded in the genes. But in this blog post, I’m going to restrict “we” to humans. Unlike other animals, we take a really long time to develop and learn a lot through experience, but we have strong enough inductive biases that we mostly all end up learning how to be humans.
So, if the Bitter Lesson was that hand-crafted content doesn’t outperform general-purpose learning at scale, the bittersweet lesson is that some structure must still be built in. This blog post is about what that structure is. It’s not about the learning, per se, but about what will be true of representations that support intelligence like ours.
In my view, there are three fundemental ingredients. I think all intelligence will have these three core features. These are necessary (and possibly sufficient) for representations that support intelligence. I introduce them in a particular order because they build on each other in sequence:
Geometry (Representation of Transformations)
Hierarchy (Coarse Graining)
Action (Representations framed in the self. Learning through Doing)
Of course hierarchy is familiar to anyone who has worked with deep learning. And action is familiar to anyone in RL. But I’m going to make these definitions more precise below and suggest ways in which they interact. Importantly, LLMs already have aspects of these ingredients, but they need all three.
Geometry reduces the space of solutions, constructs stable invariants, and facilitates generalization
Geometry isn’t really about shapes. It is about how things relate to each other, and the transformations that preserve those relationships.
Rotate a square, and it’s still a square.
Transpose a melody into a higher key, and it’s still the same tune.
Move an object and all its parts move in relation to each other.
Here is the working definition of geometry I will use here: the relational structure of things and the transformations that preserve that structure. My claim here is that our brain’s representations factorize the relational invariants (the “things” in the world) and the transformations.
There’s good reason to believe brains have strong inductive biases for geometry. Geometry has a universal quality to it. Ancient cultures discovered geometry independently: Euclid in Greece, Liu Hui in China, the Sulba Sutras in India. They also discovered music scales built on frequency ratios. Why? Because the world is structured around such invariances, and brains are especially good at discovering them.
Some animal brains have geometry literally built in: flies represent head direction in a literal ring of neurons. Others are able to learn it quickly. Humans and chicks can generalize from one modality to another with very little training, solving Molyneux’s problem. Molyneux’s problem asks whether a person born blind, upon gaining sight, could recognize shapes they had only known by touch. The empirical answer is basically “yes”. With very little experience, people with restored vision can match across senses. Chickens solve versions of Molyneux’s problem instantly. The lesson is that vision, touch, movement, and thought can align because they share relational structure. Geometry provides a common representational format across modalities.
In contrast, this is an obvious place LLMs and other transformer-based models fall short. They do exhibit some relational structure (king–man+woman≈queen), but many features live in superposition, entangled in shared dimensions rather than cleanly factorized. That makes transformations brittle and context-dependent. A concrete example comes from video models like Sora: one of the main advances in Sora 2 was “stronger frame consistency”. But why was consistency a problem in the first place? Because the underlying representations weren’t biased toward stability under transformation. Geometry is precisely about this kind of stability. The right inductive bias makes consistency default instead of something that requires training on the whole internet.
At their worst, models without geometry resemble Borges’s Funes the Memorious. The titular character has a sort of brain damage where he remembers everything in excruciatingly particular detail but is unable to grasp abstractions or maintain stable representations as they transform over time and space: Funes experiences the mane of a horse as a constantly changing flame, in contrast to our intuition of a simple geometric shape. This is exactly what happened in early video models. One solution is more data. Another is to learn like we do and build the right inductive biases into the model.
The limits of geometric formalism
In modern deep learning, some approaches do try to build in geometry: this is the domain of equivariant neural networks. Equivariance means that transformations in the inputs have corresponding transformations in the representations. For example, convolutional neural networks are translation equivariant (at least up to aliasing artifacts). If you translate the input image, the representations are translated the same way. Geometric deep learning extends this to other transformations like rotations, scalings, and more abstract ones like changes in lighting or style.
It’s important not to take the geometry metaphor too literally – and I think equivariant neural networks does just that. The mathematical framework of geometric deep learning — where representations are strictly closed under group actions — is an idealization. Biological representations do not follow that idealized form.
Consider the Thatcher effect. When a face is inverted and the eyes and mouth are flipped right-side up within it, the result looks grotesque when the whole image is turned upright — but appears surprisingly normal when viewed upside down. If face representations were strictly equivariant under rotation, inverting the whole image would simply invert the representation, and the local manipulations would be just as salient either way. They aren’t. The brain’s representation of faces is heavily orientation-dependent, tuned to the statistics of upright viewing, and breaks in informative ways when that assumption is violated.
The point is that biological representations are more flexible. Representations are approximately invariant under common transformations, but the approximation degrades for transformations the organism rarely encounters. This means the system allocates representational precision where it matters most. Strict group-theoretic closure is a useful mathematical framework, but I suspect real intelligence will be more flexible — strong geometric biases shaped by the statistics of experience and the demands of behavior.
Hierarchy goes two ways
Hierarchy is primarily about coarse-graining. Coarse-graining means throwing away detail to focus on structure at a larger scale. In physics, you might ignore individual molecules and describe only temperature and pressure. In intelligence, hierarchy works the same way: pixels become edges, edges become shapes, shapes become objects, objects become categories.
But true hierarchy isn’t just one-way compression. Once you’ve formed a high-level interpretation, the details have to fit it. If you decide you’re looking at a coffee cup, the handle, rim, and shading all snap into place. This is how inference actually works: high-level concepts reshape how low-level evidence is understood. Without it, abstraction floats unanchored from the data.
Deep learning captures part of this. Deeper layers of CNNs and transformers do become more abstract. But the flow is only forward. The model doesn’t settle on an interpretation and then revise the details to be consistent with it. Hierarchy in brains is different. For every feedforward connection, there is a feedback connection. And these feedback connections are not merely modulatory — they reshape early representations. Lee and Mumford (2003) argued that recurrent processing between levels of the visual hierarchy is essential for resolving ambiguity: early visual areas don’t just pass information up; they receive interpretations back down, and their activity changes as a result. An edge that is ambiguous in isolation becomes a contour when the higher level has settled on an object. This is what makes biological hierarchy genuinely bidirectional, not just deep.
I’ve pointed out in the past how much inference will look like feedforward processing in most conditions. But overall, perception toes the line. Low-level details adjust to fill in.
Geometry and hierarchy work together. Think about zooming out on Google Maps: the street names and details disappear and more abstract concepts like pedestrian zones become visible. But now move the map and zoom back in and all the details are still there in the right place. That is geometry (relationships preserved under transformation) joined with hierarchy (abstraction across scales). If those two don’t fit together, details wander off when concepts move.
What does this mean for model building? It means that you probably cannot simply tokenize video input and let all intelligence happen downstream. The low level details need to be linked to high level concepts. It’s not just that the high level concepts are constructed from low level input, they, themselves are stable and the low-level representations are inferred to be consistent with the input and the high-level concepts.
Action
Acting is the only way to discover causal structure. This is widely appreciated by the RL community, but often not given enough emphasis elsewhere.Sergey Levine makes this point well here. We humans learn and think through action. Sutton’s main argument against LLMs is that action is not built in to their objective — “they have no goal”.
But action creates a deep representational problem: it entangles the self with the world. Every time you act, the sensory data reflect both the world’s response and your own movements. To be intelligent, a system must disentangle self from world. Think about a camera mounted to a robot’s head. As the robot moves through space, it picks up motion signatures that have nothing to do with the world itself. The brain must separate these out. This is where disentanglement actually becomes important.
There was a brief moment where the field cared a lot about disentanglement, but this was largely abandoned in favor of untangling — finding linearly decodable concepts after training — rather than inducing disentangled representations where individual dimensions are interpretable. One thing that came of this brief interest is that we know it is very unlikely to get disentangled representations without strong inductive biases. Interestingly, hierarchy naturally leads to more disentangled representations when the data are truly hierarchical (as in the real world). Yet despite their depth, LLMs are totally entangled.
Imagine trying to build a causal model of the world on top of such an entangled representation. Sure, you can learn a model of transformations between frames, and with a million hours of video it will actually work — see Genie. But that’s really not learning a world model like we do. We learn to perceive the world in terms of action. It should be understood that sensory inputs, , should be directly related to action, and vice versa. This is something that Gibson recognized, and that conventiaonal approaches to perception missed. Perception and action can’t be treated as separate systems. In a conventional framing, perception is inference — — and action is a downstream decision based on that inference. Gibson argued that this gets the relationship backwards. What you perceive isn’t just what’s there; it’s what you can do, or how the world “affords” action. A chair affords sitting. A handle affords grasping. Perception, for Gibson, is closer to — a representation of the actions available to the agent, given the current sensory state.
Whether or not you buy Gibson’s full framework, the core insight stands: useful representations tie to . A representation that encodes the world without encoding what you can do in it is incomplete. Similarly, a representation that encodes the world without encoding what will happen if you act is also incomplete. And this is precisely what’s missing from models that treat perception as a self-contained inference problem and bolt action on afterward.
The self as a constructed geometry
This disentanglement of self and world is one of the deepest features of conscious experience. The brain actively constructs a model of the self: a body schema, an egocentric reference frame, a felt sense of agency and ownership. I may not have convinced you that the self is constructed, but I think two simple examples reveal some of this process.
Consider the cyclopean eye — the single, constructed viewpoint that sits roughly between your two eyes and serves as the origin of your visual coordinate system. It doesn’t correspond to any physical sensor. It’s an inference, built from the geometry of binocular vision. It is the ego-center from which all spatial perception is organized – a coordinate frame from which other things are experienced. And yet it’s malleable: hold your hands up, form a diamond between your thumbs and index fingers, and look through it at a distant object. Close one eye, then switch. The object jumps behind the hole created by your hands – revealing that your cyclopean eye was centered on one physical eye, not the other. Importantly, you can change the dominant eye back and forth without much effort. This shows that the coordinate frame in which everything visual appears is not fixed.
The rubber hand illusion pushes this further. When a person watches a rubber hand being stroked in synchrony with their own hidden hand, they begin to feel that the rubber hand is their hand. The self has expanded to incorporate something external. And it follows geometric principles: the illusion works because the visual and tactile signals are spatiotemporally consistent, creating a coherent transformation between what you see and what you feel. Break that consistency — introduce a delay, or stroke out of sync — and the illusion collapses. The self is constructed from geometry, but it’s a soft geometry, one that bends when the evidence is compelling enough.
The logic of the self is simple: If perception is inference over the causes of the senses, and action is causing much of the sensory changes, you need to infer yourself as a cause.
Conclusion
These three ingredients — geometry, hierarchy, and action — are not content to be learned. They are constraints on the form that representations take. Right now we are letting general purpose learning algorithms figure out these constraints, and the required data is both too expensive to continue down this path and diverges too much from representations that are like ours for us to want to cede our futures to an alien type of intelligence. Without those inductive biases, the problem is underconstrained — you can fit the data, but you need far too much of it. Take π₀ for example. π₀ was trained on 10,000 hours of robot demonstration data for embodied pretraining, on top of a Vision-Language Model already pretrained on Internet-scale image-text data. This is, frankly, extraordinary. And not at all an agent that learns like we do. I don’t think it is posisble to lay out all the arguments for the alternative path in one blog post, but I wanted to outline the three components that I think are necessary of any representation that supports intelligence like ours.