Generative Models in Perception

I started this tutorial on “perception-as-inference” in the last post with the idea that – through the mechanisms of  ambiguity and noise – the world enters the mind through the senses in an incomplete form, lacking a clear and unambiguous interpretation. I hinted that perception may engage an inference process, using its prior experience in the world to settle on a particular likely interpretation of a scene (or perhaps a distribution of likely interpretations). In this school of thought, perception itself is the result of an inference process deciding on one likely interpretation of sensory data or another. The key function of sensory neurons in the brain would then be computing and evaluating probability distributions of plausible “features” of the world.

But… there is no one way to build a probabilistic model. How does the brain know what “features” to look for? For example, how does it decide to sort out the cacophony of electrical signals coming from the optic nerve in terms of objects, lights, textures, and everything else that makes up our visual experience? When listening to music, how does it decide to interpret vibrations of the eardrum as voices and instruments? One appealing hypothesis is that the brain learns1  generative models for its stream of sensory data, which can be thought of as a particular type of probabilistic model that captures cause and effect2. In our visual example, objects, lights, and textures are the causes, and electrical signals in the visual system are the effects. Inference is the reverse process of reasoning about causes after observing effects. More on this below…

Generative Models, Latent Variables

Given some complicated observed data, a generative model posits that there exists a set of unobserved states of the world as the underlying cause of what is being seen or measured. Let’s take a more relatable example. When radiologists learn to read X-rays, they could learn to directly correlate the patterns of splotches in the image pixel by pixel with possible adverse health symptoms; but this would not be a very good use of their time. Instead, they learn how diseases cause both adverse health symptoms and patterns of scan splotches. The ailment or disease may never be observed directly (it is a latent variable), but it may be inferred since the doctor knows how the disease manifests in observable things – i.e. the doctor has learned a generative model of X-ray images and symptoms conditioned on possible diseases.

To perceive objects in a scene, your brain solves an analogous problem. In the visual example, the impulses in your optic nerve are the “symptoms” and any objects, people,or shapes you perceive are the root cause of them – the “disease” (no offense to objects, people, and shapes). That is, we reason about visual things in terms of objects because our visual system has implicitly learned a (generative) process in which objects cause signals in the eye. This process involves photons bouncing off the object then passing through the eye, the transduction of those photons into electrical signals of retinal rod and cone cells, some further retinal preprocessing of those signals, and eventually relaying them down the optic nerve to the rest of the brain. Suffice it to say, it is complicated. Now imagine trying to invert that whole process, going from nerve signals back to objects, and you might gain a new appreciation for what your visual system does every waking moment of your life!

Analysis by Synthesis

One intuitive, though not very effective way of doing inference with generative models is the idea of analysis by synthesis[1]. Using the X-ray example from above, imagine the life of the frazzled doctor who memorized a procedure for sketching drawings of what different diseases might look like on a scan, but has not yet figured out how to go in the other direction – i.e. to look at a scan and jump to a diagnosis. “Surely,” the doctor thinks, cursing her backwards education, “they should not have taught us how diseases cause symptoms, when what we really care about is the other way around – making a diagnosis!”

But this doctor can still make progress. Imagine that she churns out sketches of expected scans for every possible disease in proportion to how often each disease occurs, and compares the sketches side by side to a patient’s scan. After a long night with pencil and paper, she finds that her imagined sketch of a hypothetical case of pneumonia looks suspiciously similar to the scan of a patient earlier that day (and other sketches have not matched). Suddenly, pneumonia became the most probable diagnosis.

This example shows that data can be analyzed simply by synthesizing exemplars and comparing each one. In spirit, this is what inference in a generative model is all about – finding the most likely (unobserved) causes for some (observed) effects by searching over all possible causes and considering (1) whether it is consistent with the observations, and (2) how likely the cause is a priori. In practice, there are much, much more efficient algorithms for inference, which will be described in more detail in future posts.

For now, I will end by suggesting that you take a few moments to introspect next time you get tricked by an every-day illusion. It happens all the time – we hear a distant sound or see something out of the corner of our eye and think we know what it is, then a moment later we reconsider and realize we’ve made a mistake. Next time this happens, ask yourself whether your first impression made sense in the context of generative models and inference. Did you jump to the first conclusion because it was simpler? Were you expecting one thing but encountered another? Could the “data” coming into your senses have plausibly been generated by both the first and second interpretation?


  1. This could mean an individual’s learning from experience, or coarser shaping of the system by evolution.
  2. Sometimes the term “generative model” is used off-hand to mean the same thing as a “probabilistic model.” If you give me a joint distribution p(X,Y) of two variables X and Y, I can generate values of X consistent with the constraint that Y takes on a particular value y by evaluating p(X|Y=y). When we factorize the joint distribution into p(X|Y)p(Y), we say that we are modeling a process where Y generates X. Conversely, p(Y|X=x) can be used to generate values of Y consistent with X taking on the value x. However, there is a distinction to be made between simply factorizing a joint distribution in one way or another, as I just described, and having a true generative model. The former just describes correlations or statistical coincidences, while the latter describes causation of the form “if Y takes on this value, then X will take on these other values with some probability.” The distinction matters when an intervention can be made to perturb Y and we care about whether this will affect X. In the context of perception-as-inference, we typically have the latter type – a true causal model – in mind. It is unclear to me, however, if I would be any worse off with a purely correlational visual system. (In other words, it might not matter to my survival if I assume that my senses cause the world to exist in a particular state). Perhaps I will revisit this distinction in a future post.


[1] Yuille, A., & Kersten, D. (2006). Vision as Bayesian inference: analysis by synthesis? Trends in Cognitive Sciences, 10(7), 301–308.

Uncertainty, Ambiguity, and Noise in Perception

Why probability?

Let’s start with a broad question: what makes probability theory the right tool to model perception? To put it simply, the world is much fuzzier and less certain than it seems. Given the chance to design a system that functions in an uncertain world, the optimal thing to do would be to have it explicitly reason about probable and improbable things. But if the brain is not designed per se, then is there more than human ego to make us think that it would function in a similarly optimal way? In a future post, I will outline the kinds of evidence there are for this, but for now let’s just assume (or let’s hypothesize) that the brain is at least trying to do the right thing, and is getting pretty close. This is known as the Bayesian Brain Hypothesis,[1] and its roots go at least as far back as Hermann von Helmholtz, who in 1867 described perception as a process of “unconscious [probabilistic] inference”.[2]

A similar story can be told for cognition: where are the crisp boundaries between concepts like ‘cup’ and ‘mug’? What makes something ‘art’? When is thought precise? Instead, the mind works largely by induction, generalization, simulation, and reasoning, all of which are naturally formalized in probabilistic terms.

It should be clarified that a probabilistic brain is not necessarily a random brain. If you are predicting the outcomes of a weighted coin flip that is heads 55% of the time and tails 45% of the time, the ideal probabilistic response would be to guess heads every time. What’s important is that a probabilistic system is not very confident in that guess. Later, we will see examples of ‘sampling’ algorithms where generating random numbers is a tool for reasoning probabilistically, but other algorithms achieve probabilistic answers with no randomness at all!

Whence comes the uncertainty

Even in early sensory processing, the brain faces substantial uncertainty about the ‘true’ state of the world. I like to break this problem down into two parts that I call ambiguity and noise. Using visual terms,

  1. ambiguity arises when many different ‘world states’ can give rise to exactly the same image.
  2. noise refers to the fact that subtly different images might evoke the same activity in the brain.

"illusion" in which an object casts a shadow on a chess board. The light squares in shadow are the exact same color as the dark squares not in shadow, but are perceived differently. The chessboard image shown here is a classic example of how ambiguity arises and how the brain resolves it. White squares in the cylinder’s shadow are exactly the same color as dark squares outside the shadow, yet they are perceived differently. To put it another way, exactly the same image (a gray square) was created from different world states (light square + shadow or dark square + no shadow). It is easy to think that this is a trivial problem since we so quickly and effortlessly perceive the true nature of the scene, but these same mechanisms make us susceptible to other kinds of illusions.

Noise occurs partly because neurons are imperfect, so the same image on the retina evokes different activity in visual cortex at different times. Noise also occurs when irrelevant parts of a scene are changing, like dust moving on a camera lens, or small movements of an object while trying to discern what it is. These are called “internal noise” and “external noise” respectively. Strictly speaking, noise as described here is not by itself an issue; the same input may map to many different patterns of neural activity in the brain, but as long as we can invert the mapping there is no problem! The only time we cannot do this inversion is when the noise results in overlapping neural patterns for different states in the world.1 That is, noise is only an issue if it results in ambiguity!

Cartoon of mappings from world states "A" "B" and "C" to brain states. All three mappings contain "noise" so that a single world state maps to multiple brain states. This is only a problem when the noise causes two world states to be mapped to the same brain state.
A, B, and C refer to different world states (different images) on the left, and different brain states (patterns of neural activity) on the right. Despite noise (gray ovals), C can be distinguished from both A and B. Where A and B overlap, the mapping from the world to the brain cannot be inverted, so there would be uncertainty in whether A or B was the cause.

Noise explains why there is a limit to our ability to make extremely fine visual distinctions, like the difference between a vertical line and a line tilted off of vertical by a small fraction of a degree – similar enough inputs will have indistinguishable patterns of neural activity.

Finally, it is important to note that an information bottleneck in the visual system also indirectly implies a kind of noise.2 Information bottlenecks arise whenever a ‘channel’ can take on fewer states than the messages sent across it; a classic example is that the optic nerve has too few axons to transmit the richness of all retinal patterns.

Visualization of how a complicated scene passed through an "information bottleneck" results in a loss of detail.
Think of an information bottleneck as losing detail about a scene. The logic may seem backwards, but the lack of detail in the right implies that some scenes on the left are forced to become indistinguishable after passing through the bottleneck.

The fact that an information bottleneck implies uncertainty is counter-intuitive at first since it makes no statement about how any particular image or scene is affected.3 Perhaps it is more intuitive to think of an information bottleneck as a kind of continuous many-to-one mapping, where different inputs are forced to map to similar neural states. As seen in the overlap between states “A” and “B” in the second illustration above, a many-to-one mapping cannot be inverted, so there must be uncertainty about the true scene.

Wrapping up

Wherever there is uncertainty, the optimal thing to do is to play the odds and think probabilistically. As I hope to have conveyed in this post, uncertainty about the ‘true’ state of the world is a ubiquitous problem for perception, though it may not seem so introspectively. Future posts will elaborate on the process of inference, which resolves such uncertainty and settles on the most likely interpretation(s) of an image or scene.


1. advanced readers may recognize this as the logic behind information-limiting correlations.[3]

2. for those familiar with information theory, this is because an upper bound on the mutual information between input (an image) and evoked neural activity implies a lower bound on the conditional entropy of the neural activity given the input. For those unfamiliar with information theory, stay tuned for a future post =)

3. rate distortion theory allows one to make more concrete statements about this mapping, but only having assumed a loss function that quantifies how bad it is to lose some details relative to others


[1] Knill, D. C., & Pouget, A. (2004). The Bayesian brain: the role of uncertainty in neural coding and computation. Trends in Neurosciences, 27(12), 712–9.

[2] von Helmholtz, Hermann. (1867). Handbuch der physiologischen Optik.

[3] Moreno-Bote, R., Beck, J. M., Kanitscheider, I., Pitkow, X., Latham, P., & Pouget, A. (2014). Information-limiting correlations. Nature Neuroscience, 17(10), 1410–1417.