Generative Models in Perception

I started this tutorial on “perception-as-inference” in the last post with the idea that – through the mechanisms of  ambiguity and noise – the world enters the mind through the senses in an incomplete form, lacking a clear and unambiguous interpretation. I hinted that perception may engage an inference process, using its prior experience in the world to settle on a particular likely interpretation of a scene (or perhaps a distribution of likely interpretations). In this school of thought, perception itself is the result of an inference process deciding on one likely interpretation of sensory data or another. The key function of sensory neurons in the brain would then be computing and evaluating probability distributions of plausible “features” of the world.

But… there is no one way to build a probabilistic model. How does the brain know what “features” to look for? For example, how does it decide to sort out the cacophony of electrical signals coming from the optic nerve in terms of objects, lights, textures, and everything else that makes up our visual experience? When listening to music, how does it decide to interpret vibrations of the eardrum as voices and instruments? One appealing hypothesis is that the brain learns1  generative models for its stream of sensory data, which can be thought of as a particular type of probabilistic model that captures cause and effect2. In our visual example, objects, lights, and textures are the causes, and electrical signals in the visual system are the effects. Inference is the reverse process of reasoning about causes after observing effects. More on this below…

Generative Models, Latent Variables

Given some complicated observed data, a generative model posits that there exists a set of unobserved states of the world as the underlying cause of what is being seen or measured. Let’s take a more relatable example. When radiologists learn to read X-rays, they could learn to directly correlate the patterns of splotches in the image pixel by pixel with possible adverse health symptoms; but this would not be a very good use of their time. Instead, they learn how diseases cause both adverse health symptoms and patterns of scan splotches. The ailment or disease may never be observed directly (it is a latent variable), but it may be inferred since the doctor knows how the disease manifests in observable things – i.e. the doctor has learned a generative model of X-ray images and symptoms conditioned on possible diseases.

To perceive objects in a scene, your brain solves an analogous problem. In the visual example, the impulses in your optic nerve are the “symptoms” and any objects, people,or shapes you perceive are the root cause of them – the “disease” (no offense to objects, people, and shapes). That is, we reason about visual things in terms of objects because our visual system has implicitly learned a (generative) process in which objects cause signals in the eye. This process involves photons bouncing off the object then passing through the eye, the transduction of those photons into electrical signals of retinal rod and cone cells, some further retinal preprocessing of those signals, and eventually relaying them down the optic nerve to the rest of the brain. Suffice it to say, it is complicated. Now imagine trying to invert that whole process, going from nerve signals back to objects, and you might gain a new appreciation for what your visual system does every waking moment of your life!

Analysis by Synthesis

One intuitive, though not very effective way of doing inference with generative models is the idea of analysis by synthesis[1]. Using the X-ray example from above, imagine the life of the frazzled doctor who memorized a procedure for sketching drawings of what different diseases might look like on a scan, but has not yet figured out how to go in the other direction – i.e. to look at a scan and jump to a diagnosis. “Surely,” the doctor thinks, cursing her backwards education, “they should not have taught us how diseases cause symptoms, when what we really care about is the other way around – making a diagnosis!”

But this doctor can still make progress. Imagine that she churns out sketches of expected scans for every possible disease in proportion to how often each disease occurs, and compares the sketches side by side to a patient’s scan. After a long night with pencil and paper, she finds that her imagined sketch of a hypothetical case of pneumonia looks suspiciously similar to the scan of a patient earlier that day (and other sketches have not matched). Suddenly, pneumonia became the most probable diagnosis.

This example shows that data can be analyzed simply by synthesizing exemplars and comparing each one. In spirit, this is what inference in a generative model is all about – finding the most likely (unobserved) causes for some (observed) effects by searching over all possible causes and considering (1) whether it is consistent with the observations, and (2) how likely the cause is a priori. In practice, there are much, much more efficient algorithms for inference, which will be described in more detail in future posts.

For now, I will end by suggesting that you take a few moments to introspect next time you get tricked by an every-day illusion. It happens all the time – we hear a distant sound or see something out of the corner of our eye and think we know what it is, then a moment later we reconsider and realize we’ve made a mistake. Next time this happens, ask yourself whether your first impression made sense in the context of generative models and inference. Did you jump to the first conclusion because it was simpler? Were you expecting one thing but encountered another? Could the “data” coming into your senses have plausibly been generated by both the first and second interpretation?


  1. This could mean an individual’s learning from experience, or coarser shaping of the system by evolution.
  2. Sometimes the term “generative model” is used off-hand to mean the same thing as a “probabilistic model.” If you give me a joint distribution p(X,Y) of two variables X and Y, I can generate values of X consistent with the constraint that Y takes on a particular value y by evaluating p(X|Y=y). When we factorize the joint distribution into p(X|Y)p(Y), we say that we are modeling a process where Y generates X. Conversely, p(Y|X=x) can be used to generate values of Y consistent with X taking on the value x. However, there is a distinction to be made between simply factorizing a joint distribution in one way or another, as I just described, and having a true generative model. The former just describes correlations or statistical coincidences, while the latter describes causation of the form “if Y takes on this value, then X will take on these other values with some probability.” The distinction matters when an intervention can be made to perturb Y and we care about whether this will affect X. In the context of perception-as-inference, we typically have the latter type – a true causal model – in mind. It is unclear to me, however, if I would be any worse off with a purely correlational visual system. (In other words, it might not matter to my survival if I assume that my senses cause the world to exist in a particular state). Perhaps I will revisit this distinction in a future post.


[1] Yuille, A., & Kersten, D. (2006). Vision as Bayesian inference: analysis by synthesis? Trends in Cognitive Sciences, 10(7), 301–308.