## Common Misconceptions about (Hierarchical) Generative Models, Part 2

In the previous post I discussed my first 2 of 7ish not-quite-true or misleading intuitions about hierarchical generative models. If you haven’t read it yet, start there. This post picks up where the other left off.

## Background: priors, average-posteriors, and linear Gaussian models

The ideas in this post all rely on a distinction between a model’s prior and its average posterior. I find this distinction so conceptually fundamental that I’m giving it its own section here.

### Linear Gaussian Image Models

As a motivating example, consider the classic sparse linear Gaussian model used by vision researchers everywhere, and let’s start with just a single layer of latent variables. In the generative direction, images are generated as a noisy sum of features, where the features are selected so that each image only has a few of them present (a sparse prior). Inference corresponds to finding which subset of features are present in a given image. \begin{aligned} \mathbf{x} &\sim p(\mathbf{x}) \\ \mathbf{I} &= \eta + \sum_i \mathbf{A}_i \mathbf{x}_i \end{aligned} Where $\eta$ is noise added to the pixels and $\mathbf{A}_i$ is the image patch corresponding to feature $\mathbf{x}_i$. If we have $N$ features and $P$ pixels per image, then $\mathbf{x}\in \mathbb{R}^N$ and each $\mathbf{A}_i\in\mathbb{R}^P$. This is sometimes written as a matrix vector product where $\mathbf{A}$ is a $P\times N$ matrix, so that $\sum_i \mathbf{A}_i\mathbf{x}_i = \mathbf{Ax}$.

To ensure that only a few features are present in each image, $p(\mathbf{x})$ is set to be a sparse prior. In other words, most of the $\mathbf{x}$s are zero most of the time. One simple to do this is with the prior $$p(\mathbf{x}) \propto e^{-\sum_i |\mathbf{x}_i|^\alpha}$$

This is visualized in the following figure. The signature of a spare prior is mass concentrated along the axes of the space, since these are the regions where only one feature is “active” and the rest are near zero. When \alpha = 2 this is the familiar Gaussian prior. We get sparseness when 0 < \alpha \leq 1.

Now, let’s say we choose a value of $\alpha$ that gives us a sparse prior and fit the image patches $\mathbf{A}$ to data using Maximum Likelihood. Intuitively, this means we want to find a collection of image patches where any given image is well-described as the sum of a small number of them. A seminal result in vision science is that when such a model is fit to natural images, the learned features begin to look like the kinds of visual features that drive neurons in V11. In the sparse linear Gaussian model, each small image patch in a dataset is modeled as the weighted sum of features. The sparse prior encourages only a few \mathbf{x}s to explain each image. The result is that, once fit to data, the model’s \mathbf{A}_is (the features visualized on each arrow on the right) tend to pick out recurring image parts like edges, corners, gradients, etc, loosely resembling the kinds of things neurons in V1 seem to care about.

### Maximum Likelihood as distribution matching

I mentioned Maximum Likelihood as a way to “fit” the $\mathbf{A}$s to data. Maximum Likelihood chooses the parameters of a model for which the data that we see is most likely to have been generated by the model with those parameters. There’s a nice connection between this idea of finding the parameters for which each image is most likely and the idea of fitting the distribution of images. Recall that the marginal likelihood of a model is the distribution over its outputs that you get by averaging over all possible assignments to other variables: $$p(\mathbf{I};\mathbf{A}) = \int p(\mathbf{I|x;\mathbf{A}}) p(\mathbf{x}) d\mathbf{x}$$ Another way to view Maximum Likelihood learning is making the marginal likelihood as close as possible to the actual data distribution. To see why, we will to use Kullback-Leibler (KL) divergence, which is a common mathematical tool for quantifying how dissimilar two probability distributions are. Intuitively (but, it turns out, not formally), KL is like the “distance” between two distributions – it has the desirable properties that it is zero only when the two distributions are exactly equal, and it gives a positive value otherwise. It also just so happens that KL divergence is not symmetric (one of the reasons it’s not formally admissible as a “distance” measure), so we write that the divergence “from p to q” is $$KL(p||q) = \int p(\mathbf{I}) \log\frac{p(\mathbf{I})}{q(\mathbf{I})}d\mathbf{I}$$

A reasonable goal when fitting a generative model to data is that the marginal likelihood of the model should match the empirical data distribution. In other words, we should seek to minimize the distance (e.g. measured using KL) between two distributions: the marginal likelihood and the data distribution. But, we don’t have access to the data distribution itself, only a dataset images sampled from it. Conveniently, KL has the form of an expected value, so it can be estimated using samples from $p$: \begin{aligned} KL(p||q) &= \int p(\mathbf{I}) \log\frac{p(\mathbf{I})}{q(\mathbf{I})}d\mathbf{I} \\ &= \mathbb{E}_p\left[\log\frac{p(\mathbf{I})}{q(\mathbf{I})}\right]\\ &\approx \frac{1}{D} \sum_{i=1}^D \log \frac{p(\mathbf{I}_i)}{q(\mathbf{I}_i)},\;\mathbf{I}\sim p \\ &= \frac{1}{D} \sum_{i=1}^D \log p(\mathbf{I}_i) - \log q(\mathbf{I}_i) &\\ &= -\frac{1}{D} \sum_{i=1}^D \log q(\mathbf{I}_i) + const,\;\mathbf{I}\sim p\\ \end{aligned} If $q$ is the model we’re trying to fit to some given data, we simply ignore the $\log p$ term, since nothing we change in the model will affect it. Returning to our goal of fitting a model by matching marginal likelihood to the data distribution, we see now that minimizing KL is equivalent to minimizing the sum of $-\log q$ for all of our data points, which is just another way of saying maximizing the likelihood of q!

If you take nothing else away from this post, remember this: when we fit a generative model to data, we’re at best getting the marginal likelihood close to, but not equal to to the data distribution (at worst, there are bugs in the code and/or we get stuck in local optima). A sparse linear Gaussian model is in fact a terrible model of what images actually look like and its marginal likelihood will not look like real image patches. Still, we can try fitting the model to a dataset of natural images using Maximum Likelihood to get as close as possible under the restrictive assumptions that the world is sparse, linear, and Gaussian.

### Priors and Average–Posteriors The rules of probability define a “loop” from the prior p(x) to images (or data more generally) p(I) and back again. Inference and generation are mirror images. I call this “Bayes’ Loop”. The next figure describes how this symmetry falls apart in practice.

So what does this digression on divergence and model-fitting have to do with priors and posteriors? Well, when we fit a model to some data that does a poor job of capturing the actual data distribution (i.e. the KL between the data distribution and marginal likelihood remains high), some elementary rules in probability seem to break down. Take the definition of marginalization, which tells us that $p(\mathbf{x}) = \int p(\mathbf{x,I}) d\mathbf{I}$, and the product rule which tells us that $p(\mathbf{x,I}) = p(\mathbf{x|I})p(\mathbf{I})$. Putting these together, we get $$p(\mathbf{x}) = \int p(\mathbf{x|I})p(\mathbf{I}) d\mathbf{I}$$ In plain English, this says that the prior $p(\mathbf{x})$ is equal to the average posterior (each $p(\mathbf{x|I})$ is one posterior, and the average is taken over many images $p(\mathbf{I})$). The trouble is, this is only true if we use a self-consistent definition of all of the probabilities involved. It’s tempting to replace the integral over $p(\mathbf{I})$ here with an expected value using the data as we did for KL above, but this is not allowed if $p_{data}(\mathbf{I}) \neq p(\mathbf{I};\mathbf{A})$, which is almost certainly the case for any model of real data!

There are essentially two different probabilistic models at play, one defined by the generative direction, and one defined by the inferences made on a particular dataset. This is visualized in the next figure. The key point is that in any realistic setting, the chosen “prior” will not equal the “average posterior,” even after fitting with Maximum Likelihood. In fact, one could reasonably argue that the average posterior is a better definition of the prior than the one we started out with! On the left in blue is our generative model, and on the right in green is what I’ll call the inference model. Unlike in the previous figure, the two sides are distinct and don’t form a closed loop. “Fitting a model” means adjusting the p(I|x) term until the marginal likelihood (lower left) is as close as possible to the data distribution (lower right). In any nontrivial setting, there will be some error 𝛜 — some part of the data that our model is unable to capture. As long as there is some 𝛜 at the bottom, there will be some η>0 error at the top, between the regularizer r(x) and the average posterior q(x). Confusion abounds when in some contexts the “prior” means r(x), and in other contexts it means q(x)! I am deliberately not writing “p(x)” from now on to keep these ideas separate.

With the figure above as a reference, let’s define the following terms:

• r(x) is what I’ve so far called the “prior” – the distribution on x we choose before fitting the model. It really should be called the regularizer (hence my choice of “r”). During fitting, it guides the model towards using some parts of x and away from others, but does not by itself have any real guarantees.
• p(I|x) is the part that does the generating. When x is given, we can use it to sample an image. When I is given, it defines the likelihood of x. Importantly, this term is the “glue” which connects the generative model on the left to the inference model on the right.
• r(I) is the distribution of images we get when we sample x from r(x) and I from p(I|x): $$r(\mathbf{I}) = \int p(\mathbf{I|x})r(\mathbf{x})d\mathbf{x}$$
• p(I) is the “true” distribution of images. We never have access to the distribution itself, but typically have a dataset of samples from it. Having a dataset sampled from p is like having a $\rho$ that approximates p as a mixture of delta distributions: $p(\mathbf{I}) \approx \rho(\mathbf{I}) = \frac{1}{N}\sum_{i=1}^N \delta(\mathbf{I}-\mathbf{I}_i)$
• q(x|I) is the “pseudo-posterior” we get when we use p(I|x) as the likelihood and r(x) as the prior.1 It’s “pseudo” since, in some sense, r(x) isn’t really a prior! (More on this in a moment). The pseudo posterior is defined using Bayes’ rule with the “r” model (in other words, it’s the inference we would make if we assumed that r(x) was the correct prior): $$q(\mathbf{x|I}) = \frac{r(\mathbf{x})p(\mathbf{I|x})}{r(\mathbf{I})}$$
• q(x) is the “average pseudo-posterior” on the dataset: $q(\mathbf{x}) = \int p(\mathbf{I})q(\mathbf{x}|\mathbf{I}) \approx \frac{1}{N}\sum_{i=1}^N q(\mathbf{x}|\mathbf{I}_i)$.

With this foundation in mind, let’s move ahead to the three main misconceptions of this post. But first, here are two bonus thoughts based on the above:

• Imagine generating new data by first selecting x then choosing I conditioned on x. Sampling x from the regularizer will result in images that do not look like the data, while if we first sample x from the average posterior, they will. Zhao et al (2016) used an analogous argument when they showed that alternately sampling from images and from the first layer of a hierarchical model is sufficient to samples all images2, though they concluded from this that hierarchical models are in some sense fundamentally broken (a nice derivation but hasty conclusion IMHO).
• Estimates of mutual information are inflated when comparing each data point’s posterior to the regularizer, rather than comparing each posterior to the average posterior. This is only half of the reason why I have βeef with β-VAEs3-4, which will hopefully be the subject of a future post.

## Intuition 3: the prior is a free parameter

Now it’s really time to get to today’s common misconceptions.

Selecting the regularizer r(x) is one of many design choices for a model. For instance, I described above how some sparse coding models begin by selecting $\alpha$ to define a level of sparseness.

During fitting, the regularizer r(x) only acts as a guide, encouraging but not enforcing a distribution on the latents. For nearly all intents and purposes, the average pseudo-posterior q(x) is a better use of the term “prior” than r(x). While not perfect, q(x) is in some sense “closer” to obeying the rules of probability and Bayes’ Loop than r(x). Importantly, q(x) is determined as much by the likelihood p(I|x) and data distribution p(I) as it is by r(x). So while the choice of regularizer r(x) is free, the resulting “prior” we get back after fitting a model is not. Unfortunately, computing q(x) and expressing it concisely is not possible in general, so it’s not uncommon to see the assumption that $q(\mathbf{x})\approx r(\mathbf{x})$ in papers, but this is an assumption that needs to be verified!

In principle, we could try to fit the prior to data as well. Imagine that we begin with a regularizer r(x), then over the course of learning we adjust it to better match the average posterior q(x). Each new r(x) defines a new q(x), which we then use to update r(x). When the dust settles and the model converges, we hope to arrive at a “self-consistent” model where the regularizer matches the average posterior – the rare case where both r(x) and q(x) are promoted to the status of “prior.” Three things can go wrong with this approach: first, there is a degenerate solution where r(x) converges to a point. Congratulations! The prior equals every posterior, but the model is useless. This can be addressed by cleverly restricting the degrees of freedom of r. Second, we are always forced to select some parameterization of r(x), and there is no guarantee that q(x) can be expressed in a chosen parametric family of distributions. There will almost always be some lingering mismatch. Third, we may thus opt for an extremely flexible family of r(x) only to find that results are less interpretable. Sparse priors are popular in part because they are interpretable. Super powerful semi-parametric models generally aren’t (see any paper using “normalizing flows” to define the prior5-6). Depending on your goals in a given context, of course, this may be an acceptable trade-off.

## Intuition 4: complicated priors imply complicated posteriors In a well-calibrated model, the prior is equal to the average posterior. Each individual posterior per input (per image, e.g.) may be simple and easy to express, while taken collectively they map out a prior that is complex.

The sparse linear Gaussian model, in addition to being mathematically “nice” to analyze and implement, has the property that when it is fit to natural images, the individual latent features tend to pick out small oriented edges, much like the sorts of low-level visual features picked up by canonical V1 neurons.

Even if we ignore for a moment that a model with sparse sums of oriented edges doesn’t capture the distribution of natural images well, we can appreciate the difficulty in describing the prior for all the ways in which edges naturally co-occur – in extended lines, in curves, in textures, all of these possibly contiguous for long distances even behind occluders. Any reasonable prior for low-level visual features like edges is going to be complicated. Pseudo-posterior inference with a sparse prior is already hard enough, so the existence of a complicated prior surely makes the problem truly intractable, right?

Not necessarily! In the figure above to the right, I’ve sketched a cartoon to visualize how complicated priors may arise from simple posteriors. Think of it this way – edges may co-occur in images in complex ways, but when was the last time this kept you from seeing a particular edge in a particular image?

(Another way to say this is that inference with a complicated prior is only hard when the likelihood is uninformative, since the posterior is then more similar to the prior. When the likelihood is informative, the posterior is more similar to the likelihood, as in typical well-lit viewing conditions of natural scenes.)

## Intuition 5: complicated posteriors imply complicated priors

The reverse of the previous point can happen as well. Not only do complicated priors not necessarily imply complicated posteriors, but many complicated posteriors may conspire to fit together, summing to a simple prior! “Complicated” posteriors may, in principle, sum up to result in a simple prior like the pieces of a puzzle, each with an irregular shape, fitting together to form a simpler whole.

If this visualization seems contrived, just think of what happens in the sparse linear Gaussian model with $0 < \alpha < 1$ that has “explaining away” (e.g. the classic “overcomplete” models). The prior is “simple” by design – it has one parameter and is unimodal. Each individual posterior is in general multi-modal as the same image may be described as the sum of different of subsets of features. Following the logic of Bayes’ Loop, all of these multi-modal posteriors must necessarily sum to give us back the simple prior!2

## Footnotes

1. Writing q(x|I) may also call to mind approximate inference methods, since even computing the pseudo-posterior as described above may be intractable. In this case, it’s natural to define q(x|I) as the approximation, and q(x) as the average-approximate-pseudo-posterior.
2. What I’m calling Bayes’ Loop describes a useful diagnostic tool that a generative model and inference algorithm are implemented correctly. If it’s all implemented correctly, then you should be able to draw samples $\mathbf{x}_i \sim r(\mathbf{x})$ and use them to create a pseudo dataset $\mathbf{I}_i \sim p(\mathbf{I|x}_i)$. If the average posterior $q(\mathbf{x}|\mathbf{I}_i)$ (or average sample from the different posteriors) doesn’t match r(x), there must be a bug!

## References

1. Olshausen, B. a, & Field, D. J. (1997). Sparse coding with an incomplete basis set: a strategy employed by V1? Vision Research.
2. Zhao, S., Song, J., & Ermon, S. (2016). Learning Hierarchical Features from Generative Models.
3. Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M. M., … Lerchner, A. (2017). β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. ICLR.
4. Alemi, A. A., Fischer, I., Dillon, J. V, & Murphy, K. (2017). Deep Variational Information Bottleneck. ICLR, 1–19.
5. Rezende, D. J., & Mohamed, S. (2015). Variational Inference with Normalizing Flows. ICML, 37, 1530–1538.
6. Kingma, D. P., Salimans, T., & Welling, M. (2016). Improving Variational Inference with Inverse Autoregressive Flow. Advances in Neural Infromation Processing Systems, (2011), 1–8.

## Common Misconceptions about (Hierarchical) Generative Models, Part 1

It’s generally understood that the brain processes the visual world hierarchically, starting with low-level features like patches of color, boundaries, and textures, then proceeding to a representation of whole scenes consisting of objects, people, and their relations. And, as I’ve written in previous posts, we also have reason to believe that the brain has learned a probabilistic generative model of the world where data come in through the senses analogously to raw pixels from a camera, and percepts correspond to structures in the world that the brain infers as the causes of the data.

Consider how images of random trees might be generated. First, choose a species, say a willow.  Next, generate its parts: trunk, branches, leaves, etc. Keep breaking parts down by generating their sub-parts, like how a twig contains leaves and a leaf contains a stem and veins. Finally, “render” each sub-part into a small patch of the image. This process is a hierarchical generative model of the tree’s image – we start with an abstract high-level idea of an object, then hierarchically generate more and more specific parts. Inverting this generative model (i.e. doing inference) means working in the reverse direction: starting with pixels, detecting sub-parts, aggregating to parts, and ultimately recognizing objects and scenes.

Hierarchical generative models are important both for neuroscience and for machine learning (ML) and AI. Yet, we lack general, effective methods to fit them to data, and our understanding of the brain’s internal model is correspondingly limited. This may be due, in part, to a few intuitions that have been muddled in the field. At the very least, they’ve been muddled in my own mind until recently, so I wanted to write a series of posts to share some recent insights. What follows is a series of intuitions I’ve held myself or heard from others, which are, in some big or small way, not quite right.

In total, I have 7 intuitions to be broken down into 3 posts. Stay tuned for the rest!

### Intuition 1: the brain approximates the true model of the world

What does it mean for the brain to learn an “internal model” of the world? The first (naïve) definition might be this: entities, actions, and relations that exist in the world are somehow represented by states of the brain. When looking at an image of a willow tree and its constituent parts, the brain encodes a description of them – sub-patterns of neural activity corresponding to “trunk,” “leaves,” “rustling,” and the entire tree itself.

According to the naïve definition of an internal model, the state of the world is reflected by states of the brain. I like to visualize this as a landscape reflected in water – a near-copy of the world flipped upside down and juxtaposed with it, with only a thin barrier between them. The world (W) literally reflected in the model (M). This captures the intuition that the hierarchical structure in the world is directly inverted by perceptual processes in the brain.

Of course, I call this the naïve view for a reason. The brain does not mirror the world as it really is – how could it1? Instead, the brain must have learned its own set of variables. For instance, we often imagine V1 as having learned a model for oriented edges in a scene. And while research has proven that detecting edges and other such low-level image features is an important first step towards scene understanding, I’m skeptical of any argument that those oriented-edge features exist “in the world.” At best, they exist “in the image,” but the image is a property of the observer, not the environment being perceived! The reflection metaphor deserves an update: By this I mean more than the old adage that “all models are wrong.” Allow me to explain further…

It’s common to see hierarchical descriptions of the world (as my willow tree example above) side by side with hierarchical descriptions of sensory processing in the brain. But even if the world is truly hierarchical and the brain processes it hierarchically, that does not imply that the brain “inverts” the world model at every stage! The world is hierarchical in the sense that objects contain parts which contain sub-parts, but the object-recognition processes in the brain are often thought to proceed from edges to textures to shapes back to objects. Even if the brain arrives at a fairly accurate representation of objects in the end, it takes its own route to get there. Image textures are not object parts.

It’s also worth noting that in statistics there is typically an implicit assumption that the data were generated from an instance of the model class being fit. It’s rarely stated that data are generated from the process W but fit using the the model family M1. This has important consequences that break some of the fundamental rules of probability like Bayes’ rule and the chain rule. I’ll explain more in the next post.

### Intuition 2: hierarchical models are basically just more complicated priors

Setting questions about the “true”model of the world aside for a moment, what else are hierarchical models good for? From a purely statistical standpoint, hierarchical models allow us to fit more complex data distributions. Understanding this requires defining the marginal likelihood of a model.

The marginal likelihood or model evidence is a probability distribution over the input space. Using the tree example above, it would be a distribution over all possible images of trees that might be generated by our tree-generating procedure. Recall that our procedure for generating trees involved first selecting a species, then generating the trunk, then limbs, then twigs, then leaves, etc. The marginal likelihood is the probability of getting a given image, summed over all possible runs of this procedure weighted by how likely each one is2:

$P(Im) = \hspace{-0.5em} \sum\limits_{s\in\text{species}} \hspace{-0.5em}P(s) \hspace{-0.5em} \sum\limits_{t\in\text{trunks}} \hspace{-0.5em}P(t | s) \hspace{-0.5em} \sum\limits_{b\in\text{branches}} \hspace{-0.5em}P(b|t) \hspace{-0.5em} \sum\limits_{l\in\text{leaves}} \hspace{-0.5em}P(l|b) \times P(Im|l)$

…which is, of course, the procedure for marginalizing over other variables in the model, hence the name. This image of a willow tree can be thought of as the result of a generative process from the trunk to branches to leaves and finally to the image. These variables depend on each other in ways that produce the visible structure in the image.

(Notice that I dropped the ‘species’ variable since 4 layers of hierarchy is plenty for now). We can imagine what would happen if we removed the trunk and branch variables, generating an image by placing leaves anywhere and everywhere at random3: This synthetic image matches the low-level texture statistics of the previous image. Think of it as a model of leaves without the structure given by branches. Using texture alone results in an image that lacks any overall structure.

But what if we knew enough about how leaf positions are distributed without writing the full model of tree trunks and branches? If our goal is to generate realistic tree-leaf images (because why wouldn’t it be), we could in principle get away with directly modeling the dependencies between leaves: Explicitly modeling the dependencies between leaves (a fancy prior) can result in complex images too.

Another way to say this is that, from the perspective of the marginal likelihood over images, higher-level variables simply serve to induce a fancy prior on the lower-level variables. Mathematically, we can write

${\color{blue} P_\text{fancy}(l)} = \hspace{-0.5em} \sum\limits_{t\in\text{trunks}} \hspace{-0.5em}P(t) \hspace{-0.5em} \sum\limits_{b\in\text{branches}} \hspace{-0.5em}P(b|t) \hspace{-0.5em} \sum\limits_{l\in\text{leaves}} \hspace{-0.5em}P(l|b)$

So that the marginal likelihood is simply

$P(Im) = \hspace{-0.5em} \sum\limits_{l\in\text{leaves}} {\color{blue} P_\text{fancy}(l)} \times P(Im|l)$

This means that purely from a marginal likelihood standpoint, a hierarchical model is nothing but a fancy prior4. So why should we prefer truly hierarchical models to, say, super flexible (“fancy”) families of priors? (I’m looking at you, normalizing flows). Here are a few reasons:

1. Representation matters. Explicitly representing higher-level variables is almost certainly going to be useful.
2. Despite what I wrote in intuition 1, using a hierarchical model to represent a hierarchical world is a good idea. (Ok, so this is just another way of saying that representation matters)
3. Inductive biases. There may be many ways to construct arbitrarily flexible priors, but not all of them will be equally effective. Hierarchical models for perception may generalize better from limited data.
4. Computational tractability. Most sampling and message-passing algorithms for inference become simple when dependencies between local variables are simple. For example, it would be relatively easy to write an algorithm that actually generates branches and attaches leaves to them. This is an example of a simple dependency. Imagine the mess of code it would take to generate each leaf individually conditioned on where the other leaves are!

In the next post(s), I’ll describe how complex priors do not necessarily imply complex posteriors (and vice versa), how model mismatch breaks basic assumptions of probability, and more!

### Footnotes

1. This is not a point about approximate inference. For instance, we might want to fit model M to data generated by W, but even M could be intractable. We would therefore resort to some approximation Q that gets close to M but is tractable. Now we have to juggle three distinct ideas: the true data-generating process (W), our best approximation to it in a model (M), and the inferences we actually draw (Q).
2. I’m simplifying here by assuming that the image is generated directly from the set of leaves of the tree. Perhaps $P(Im|s,t,b,l)$ would be more reasonable, since it would allow the image to depend on the species and branches of the tree directly. But! Most work on deep generative models makes the same simplification where the data depend only on the lowest level variable2. This might be reasonable if you think the “lowest level” is image features rather than object parts, as I discussed in intuition 1.
3. I’m again committing the fallacy here that I warned about in intuition 1: I’m being imprecise about the difference between features of the image and features of objects. Here it’s a matter of convenience – I don’t actually have a 3D generative model of trees to render examples from.
4. To add a comment about brains, this means that from the perspective of V1, a complicated statistical prior over natural images could in principle be feed-forward. If cortico-cortical feedback is to be understood as priors, it must be because separating the representation of high- and low-level things is useful, and/or because the brain’s model and inference algorithm are “locally simple” as described above.
1. Feldman, J. (2016). What Are the “True” Statistics of the Environment? Cognitive Science, 1–33. http://doi.org/10.1111/cogs.12444
2. Zhao, S., Song, J., & Ermon, S. (2016). Learning Hierarchical Features from Generative Models. https://arxiv.org/abs/1702.08396
3. Rezende, D. J., & Mohamed, S. (2016). Variational Inference with Normalizing Flows. https://arxiv.org/pdf/1505.05770.pdf

## The new behaviorism of deep neural networks

About a month ago, I had the chance to attend the CCN conference in Philadelphia. This post is not about all the great talks and posters I saw, the new friends I made, nor the fascinating and thought-provoking discussions I had. It’s a great conference, but this is a post about a troubling and ironic theme that I heard more than a few times from multiple speakers. The troubling part is that behaviorism is making a comeback. The ironic part is that it is driven by a methodology that is intended to replicate and elucidate the details of mental representations: deep neural networks.

In 2014, a landmark paper by Dan Yamins and others in Jim DiCarlo’s lab set the stage: they essentially showed that each layer in a 4-layer “deep” neural network trained to do object recognition could be mapped to representations found along the primate ventral stream, which is known to be involved in visual object recognition in the brain. Importantly, they went a step further and showed that the better a neural network was at classifying objects, the better it was at explaining representations in the ventral stream. This was (and is) a big deal. It was proof of a theory that had been floating in many researcher’s minds for a long time: the ventral stream analyzes the visual world by hierarchical processing that culminates in disentangled representations of objects.

So where do we go from there? Since 2014, the deep learning community has progressed in leaps and bounds to bigger, faster, better performing models. Should we expect Yamins et al’s trend to continue – that better object recognition gives us better models of the brain for free? The evidence says no: sometime around 2015, “better deep learning” ceased to correlate with “more brain-like representations.”

This is why I was surprised to hear so many speakers at CCN suggest that, to paraphrase, “to make neural networks better models of the brain, we simply need bigger data and more complex behaviors.” It all reduces to inputs and outputs, and as long as we call the stuff in between “neural,” we’ll get brain-like representations for free!

I’m not alone in questioning the logic behind this approach. A similar point to mine was articulated well by Jessica Thompson on Twitter during the conference:

Of course, a neural network that solves both problem A and problem B will be more constrained than one that solves either A or B alone. Bigger data makes for more constrained models, as long as the model’s outputs – its behaviors – are limited. Is it obvious, though, that adding complexity to the behaviors we ask of our models will likewise push them towards more human-like representations? Is it clear that this is the most direct path towards AI with human-like cognitive properties? My concern is for a research program built around neural networks that nonetheless fixates on the inputs and outputs, stimulus and behavior. This is simply behaviorism meets deep learning.

Now, nearly every cognitive scientist I’ve ever met is happy to denounce the old, misguided doctrines of “behaviorism.” Calling someone a behaviorist could be taken as a deep insult, so allow me to clarify a few things.

## An extremely brief history of behaviorism

The behaviorists were not as crazy as folk-history sometimes remembers them to be. To the more extreme behaviorists, led by B. F. Skinner, there was no explanatory power in internal representations in the mind, since they were assumed to be either unobservable (at best based on introspection) or reducible to inputs and outputs (Skinner, 1953, p.34). It should be noted, however, that even Skinner did not reject the existence of mental representations themselves, nor that they were interesting objects of scientific study. He simply rejected introspection, and hoped everything else would have a satisfying explanation in terms of a subject’s lifetime of stimuli and behaviors. This is not unlike the suggestion that the representations used by neural networks should be understood in terms of the dataset and the learning objective. So, why did behaviorism fall out of favor?

Behaviorism’s decline began with the realization that there are many aspects of the mind that are best understood as mental representations and are not easily “reducible” to stimuli or behavior – perhaps not a surprising claim to a modern reader. The classic example is Tolman’s discovery of cognitive maps in rats. Tolman demonstrated that mental representations are not only useful and parsimonious explanations, but are also measurable in the lab. Historically, his results spurred a shift in the emphasis of psychologists from measurement and control of behavior to understanding of the mental representations and processes that support it.

As in Yamins et al (2014), this has always been the goal of using deep neural networks as models of the brain: starting with the right architecture, optimizing for a certain behavior gives us brain-like representations “for free.” Wouldn’t it be ironic then if deep neural nets led cognitive scientists back to behaviorism?

## What are the alternatives?

The alternative to the behaviorist approach is that our models in cognition and neuroscience should be guided by more than just matching the inputs and outputs of the brain.1 The difficult but incredibly important problem here is characterizing what are the right constraints on the stuff in between. Training their model to do object recognition was interesting, but I think the success of Yamins et al (2014) came from their 4-layer model architecture which was designed to match known architectural properties of the ventral stream.2 It’s perhaps no surprise that neural networks pushing hundreds of layers have ceased to be good models of the ventral stream.

So, what kinds of constraints should we put on our model architectures? This problem needs to be approached from many directions at once: anatomical constraints of what connects to what, functional constraints on the class of computations done by each part of the model, and normative constraints like Bayesian learning and inference of latent variables. We need to look to ideas from the unsupervised learning literature on what makes good “task-independent” representations. In other words, our models need the right inductive biases. They should mimic human learning not just in the “big data” regime with millions of input/output examples, but in the limited-data regime as well.

This is not an exhaustive set of criteria and I don’t claim to have the right answer. However, I do believe that anyone interested in understanding how the brain works needs to invest more in understanding anatomical, functional, and normative constraints on representations than simply pushing in the direction of task-optimized black-boxes.

## References

• Yamins, D. L. K., Hong, H., Cadieu, C. F., Solomon, E. A., Seibert, D., & DiCarlo, J. J. (2014). Performance-optimized hierarchical models predict neural responses in higher visual cortex
• Skinner, B. F. (1953). Science and human behavior
• Tolman, E. C. (1948). Cognitive maps in rats and men. Psychological review

## Footnotes

1. For a philosophical perspective, consider John Searle’s Chinese Room thought experiment. Searle imagines a room with a mail slot, where queries written in Chinese characters are the “input” and responses are passed back out through a different slot, also written in Chinese (let’s say Mandarin). From an input/output perspective, the system appears to be an intelligent agent who is fluent in Mandarin and providing thoughtful answers to the input queries. The catch in Searle’s thought experiment is that inside the room is someone who only speaks English, but who has a very large book of “rules” for handling different types of inputs. This person, with the help of the book, is implementing an AI program that can intelligently respond to questions in Mandarin. This thought experiment can be taken in a few different directions. Here, I would argue that it’s an example of how simply matching the inputs and outputs of some desired behavior doesn’t necessarily give what you would expect or want in between.
2. Further work led by Yamins & DiCarlo has continued in the direction of testing various architectural constraints like using recurrence.

## Why Probability? (Part 2)

It seems that the vast majority of mathematical formalisms in cognitive science, psychophysics, and neuroscience almost always deal with probability theory in one way or another. This includes formalisms for how the brain models the world using probabilistic models, how constant stimuli elicit variable neural and behavioral responses, how words and concepts map onto the world, and more.

The thing about formalisms, though, is that they are often open to more than one interpretation. On its home turf of formal mathematics, probability theory makes statements about the behavior of sets of things (like the set of all integers), functions on those sets (like $P(k)=0.34$), and other things you can derive from them. What it doesn’t say is whether a statement like $P(number\_of\_puppies > 1)=0.3$ tells you anything about the real-world possibility of encountering two or more puppies. On the one hand, there is the formal mathematics stating that there is a random variable $number\_of\_puppies$ with a defined probability mass function $P$ which, when integrated over all integers larger than $1$, evaluates to $0.3$. On the other hand, there either are adorable small dogs or there are not, and it seems that this should be a closed case. The mathematics gives us a valid statement; but interpretation of such a statement as it pertains to real-world events is in the realm of philosophy. The same goes for neuroscientists and cognitive scientists – we need to be sure, independently of the mathematics, that statements about probability are grounded in reality. This means that first we need to have a solid understanding of what probability means.

Anybody using probability theory (or information theory, decision theory, signal detection theory, rate distortion theory, etc.) ought to grapple with these philosophical questions. In fact, everybody should do this because it’s interesting and fun. In the remainder of this post, I will attempt to outline my own perspective.

## True Randomness vs Ignorance

tl;dr we use probabilistic models not because something is random, but because we’re missing some information about it.

Although it’s intuitive to think that the ideas of probability only bear on random events, here I’ll try to convince you that approaching a problem probabilistically is more an admission of ignorance than a claim of randomness. Or, if you like, to claim that a system is random is a claim of ignorance. Note that by “random” I don’t mean “completely unpredictable.” For example, if you roll and sum two standard dice, the result is both “random” as well as more likely to be a 7 than a 2. That is, $$P(die_1+die_2=7) > P(die_1+die_2=2)$$

Speaking of dice, probability theory has its historical roots in gambling. Back in the 17th century, there was an immediate practical application to the idea that, when you roll a 6-sided die, there is a 1/6 chance that each number comes up, rather than attributing success to luck or to fate.

Gambling examples help reveal the dual nature of probability as both a frequentist count of the fraction of times an event happens after many, many repeats, and as a belief about single events that haven’t happened yet.

In the frequentist perspective, the statement $P(die=4)=1/6$ is a claim that, repeatedly rolling the die under the same conditions $N$ times, the fraction of times $4$ appeared will approach $1/6$ as $N$ gets large. In this view, it makes little sense to talk about the probability of a single isolated event, just as the statement $P(number\_of\_puppies > 1)=0.3$ could not be interpreted as a “yes” or “no” answer to the Big Questions, “will there be puppies? here? soon?”

This brings us to the notion of probability as a belief. Under this interpretation, we don’t need $N$ repeats of the same circumstances as $N$ gets large. Instead, probability is like a graded prediction for each future event. Based on the event’s actual outcome, it affords us a graded level of satisfaction (if the actual event was predicted with high probability) or surprise (if the event was predicted with low probability).1

Frequencies and belief perhaps should be related, but there is no mathematical law requiring them to be. Again, the interpretation of formal mathematics is outside the realm of formal mathematics. There are, however, plenty of practical and epistemological arguments why probability as a belief should be calibrated to probability as a frequency. You may strongly believe that my dice are loaded and that $3$ is the most likely number. But if it is truly a fair die, then I will always win money in the long run by having beliefs that are closer to the true frequencies (assuming bets are based on beliefs).2 Calibration of probabilities as beliefs is a practical concern in machine learning; it is dangerous for a virtual doctor to make poorly calibrated diagnoses, even if it typically gets the “most likely” answer correct. For example, if a patient is 55% likely to have disease A and 45% likely to have disease B, but a model spits out 90% for A and 10% for B, then doctors may proceed over-confidently with the wrong treatment. Even though the math police do not make arrests for poorly calibrated beliefs, it’s still a good idea to get them right.

Perhaps the best way to calibrate your beliefs would be to take the frequentist approach and repeat an experiment $N$ times and simply count how many times each outcome occurs. Immediately you run head-first into another philosophical conundrum: what counts as a repeat? Let’s look at the dice example more closely. Can you roll a jar full of different colored dice, or do you need to roll the same die each time? Does it count if you roll it on different surfaces? If each of your $N$ rolls is under slightly different circumstances, then are you justified in aggregating all the results together when estimating frequencies? Clearly there are many factors influencing the outcome of the die (see the next box-and-arrow diagram) If we let all of the “causes” in this diagram vary freely, we’ll see that the results are, well, fairly unpredictable.

Next, you might (rightly) suggest that mis-shapen dice into our experiment should not count as “repeats” of the experiment. Remember: the goal of the experiment is to estimate the actual frequencies of a 6-sided die. Let’s see what happens when we control for the shape. But why stop there? There is quite a bit of variability between the surfaces still. Might that affect the results? (In the above animations, reddish = bouncy and blueish = slippery). Let’s control for surface variety. Great. Surface variations are now accounted for. Still, our die experiment still has a lurking cause. Let’s deal with the final variable. Hmm – we get the same roll every time. Dice are the gold standard of random. They show up in every probability theory textbook. Probability was invented to make sense of them. Yet, given more and more control of the context, their randomness evaporates. Much of the world is, of course, deterministic in this sense. The frequentist idea of probability requires that we precisely repeat an experiment $N$ times, just not too precisely. This is not at all a criticism of the frequentist perspective; having well-calibrated beliefs about a well-controlled die would, of course, mean predicting that the result is almost always the same (conditioning beliefs on some additional knowledge is akin to controlling more conditions of a frequentist’s experiment).

So, dice are not really “random” after all, as long as we know enough about the context and have all the computing power in the world to aid us in making our predictions. This is where probability comes in. The reason we say a die has a 1/6 probability of landing on each number is because we never roll it the same way twice, and we never know the exact physical parameters of each roll. Probability theory allows us to make sense of the world despite our ignorance of (or inability to compute) the minutiae that makes every event unique.

(In case it is not clear by now, I am using the term “ignorance” to simply mean missing some information about a system. Complex systems have many interacting parts; if only some of them are observed, then the unobserved parts may impart forces that appear “random” in many cases.)

##### Quick side-note on randomness in quantum physics for those who are interested…

Dice may be the gold standard of randomness to statisticians, but the quantum world is the only “true” source of randomness to a physicist. Modern physics has largely accepted this philosophy that “uncertainty” in quantum physics is random in the truest sense of the word. Take the Heisenberg uncertainty principle, which states that the better you know a particle’s location the less you can know about its momentum (or vice versa); there is no “less ignorant” observer who could, even in principle, make a more accurate prediction. This is true randomness. Einstein, for what it’s worth, always hoped that quantum physics might some day be understood in deterministic terms, where what we currently think of as “random” might be explained as the workings of yet smaller or stranger particles “behind the scenes” that we simply haven’t observed yet. That is, we see the quantum world as “random” because we’re ignorant of some other hidden aspect of the universe.3 When Einstein wrote

God does not play dice with the universe.

clearly he had in mind a more random kind of die than I’ve described here. Perhaps he meant

God does play dice with the universe, but we can’t see how the dice are rolled (yet)

## Are neurons really random? Or, what did one brain area say to the other?

tl;dr maybe, but given everything above, it’s a moot point regardless.

A commonplace in neuroscience textbooks is the statement that neurons are stochastic. The classic example is that the same exact image can be presented on a screen many times, but neurons in visual cortex never seem to respond the same way twice, even when reducing our measurements to something as simple as the total spike count ($\mathbf{r}$). Models typically assume that the stimulus – and some other factors – set the mean firing rate of the neuron ($\lambda$), but that spikes are Poisson-distributed given that mean rate:
$$P(\mathbf{r}) = \frac{\lambda^\mathbf{r}e^{-\lambda}}{\mathbf{r}!}$$
As with the dice above, we can ask what unobserved factors contribute to the apparent randomness of sensory neurons. A back-of-the-envelope sketch might be the following: Sure enough, there is evidence that the “randomness” of sensory neurons begins to evaporate the better we control for eye movements, fluctuating background attention levels, etc. The more we understand and can control for these latent factors influencing sensory neurons, the less the Poisson distribution will be the appropriate choice for computational neuroscientists (and a few alternatives have been proposed under the banner of “sub-Poisson variability” models).

The 1995 study by Mainen and Sejnowski4 tells this story beautifully in a single figure: The left panel shows the behavior of an isolated neuron when driven by a constant input. The top panel overlays the traces from many repeated trials. The bottom panel is a summary showing each spike (tick mark) on many trials (rows). Each trial, the neuron begins consistenly, but stochasticity seems to emerge after some time. In panel B, the same pseudo-random input drives the cell on 25 different trials. Amazingly, the cell does nearly the same thing every trial.

Just like the dice above, given enough control of its inputs, the randomness of single-neurons seems to evaporate. Does this mean that probability is the wrong tool for understanding neural coding? Absolutely not! In the dice example, perhaps we should have stopped after controlling for the shapes of the dice. Let the rest be random.

Similarly, there is a “sweet spot” to the amount of control we should exact in our models of neural coding. Too little, and we underestimate their information-coding capabilities. Neurons with super-Poisson variability may be an indication that we are in this regime. Similarly (and perhaps surprisingly), too much precision may over-estimate the information content of a neural population. Fitting a model of individual neurons’ spike times and how they depend on the stimulus may be an indication that we are in this regime. Between these extremes is a balance between controlling for things that should be controlled and summarizing things where details are irrelevant.

If you are not convinced that having “too good” a model is a problem, consider this: how much of the information in one brain are ever makes it out of that brain area? Just as we needed to know extremely precise details of how the die was thrown (or of the inputs to a single neuron) to make precise predictions, for one brain area to “use” all of the information in another brain area would mean that an incredible amount of detail would need to be communicated between the two. Experimental evidence suggests, however, that cortical brain areas typically communicate via the average firing rate of cells, glossing over details like individual cells’ timing or the synaptic states of local circuits. These are all irrelevant details from the perspective of a downstream area. If we, as neuroscientists, want to understand how the brain encodes and transmits information, then using probabilistic models is not simply a matter of laziness or imprecision. Probabilistic models are how the experimenter to plays the role of a homunculus “looking at” the output of another brain area, because what one brain area tells the other is only ever a summary of the complex inner-workings of neural circuits.

## Recapitulation

1.  we use probabilistic models not because something is random, but because we’re missing some information about it. Or, if you like, this is really what we mean when we say something is random.
2. some things are just too complex to model even with all necessary information, so we fall back on probabilistic models.
3. maybe individual neurons are somewhat stochastic, but it’s a moot point regardless, since…
4. what brain area A tells brain area B is limited. What neuroscientists measure about brain area A is analogously limited. How neuroscientists “decode” this limited information is a better model of how brain areas communicate than more detailed encoding models (in certain situations).

## Footnotes

1. Information Theory adpots this philosophy and formalizes surprise as the negative log probability of an event.
2. This betting argument is based on the Dutch Book Argument.
3. Disclaimer: I am certainly not a physicist, so I apologize for any blatant misrepresentations here.
4. Mainen, Z. and Sejnowski, T. (1995) “Reliability of spike timing in neocortical neurons.” Science 268(5216):1503-6.

## Generative Models in Perception

I started this tutorial on “perception-as-inference” in the last post with the idea that – through the mechanisms of  ambiguity and noise – the world enters the mind through the senses in an incomplete form, lacking a clear and unambiguous interpretation. I hinted that perception may engage an inference process, using its prior experience in the world to settle on a particular likely interpretation of a scene (or perhaps a distribution of likely interpretations). In this school of thought, perception itself is the result of an inference process deciding on one likely interpretation of sensory data or another. The key function of sensory neurons in the brain would then be computing and evaluating probability distributions of plausible “features” of the world.

But… there is no one way to build a probabilistic model. How does the brain know what “features” to look for? For example, how does it decide to sort out the cacophony of electrical signals coming from the optic nerve in terms of objects, lights, textures, and everything else that makes up our visual experience? When listening to music, how does it decide to interpret vibrations of the eardrum as voices and instruments? One appealing hypothesis is that the brain learns1  generative models for its stream of sensory data, which can be thought of as a particular type of probabilistic model that captures cause and effect2. In our visual example, objects, lights, and textures are the causes, and electrical signals in the visual system are the effects. Inference is the reverse process of reasoning about causes after observing effects. More on this below…

## Generative Models, Latent Variables

Given some complicated observed data, a generative model posits that there exists a set of unobserved states of the world as the underlying cause of what is being seen or measured. Let’s take a more relatable example. When radiologists learn to read X-rays, they could learn to directly correlate the patterns of splotches in the image pixel by pixel with possible adverse health symptoms; but this would not be a very good use of their time. Instead, they learn how diseases cause both adverse health symptoms and patterns of scan splotches. The ailment or disease may never be observed directly (it is a latent variable), but it may be inferred since the doctor knows how the disease manifests in observable things – i.e. the doctor has learned a generative model of X-ray images and symptoms conditioned on possible diseases.

To perceive objects in a scene, your brain solves an analogous problem. In the visual example, the impulses in your optic nerve are the “symptoms” and any objects, people,or shapes you perceive are the root cause of them – the “disease” (no offense to objects, people, and shapes). That is, we reason about visual things in terms of objects because our visual system has implicitly learned a (generative) process in which objects cause signals in the eye. This process involves photons bouncing off the object then passing through the eye, the transduction of those photons into electrical signals of retinal rod and cone cells, some further retinal preprocessing of those signals, and eventually relaying them down the optic nerve to the rest of the brain. Suffice it to say, it is complicated. Now imagine trying to invert that whole process, going from nerve signals back to objects, and you might gain a new appreciation for what your visual system does every waking moment of your life! ### Analysis by Synthesis

One intuitive, though not very effective way of doing inference with generative models is the idea of analysis by synthesis. Using the X-ray example from above, imagine the life of the frazzled doctor who memorized a procedure for sketching drawings of what different diseases might look like on a scan, but has not yet figured out how to go in the other direction – i.e. to look at a scan and jump to a diagnosis. “Surely,” the doctor thinks, cursing her backwards education, “they should not have taught us how diseases cause symptoms, when what we really care about is the other way around – making a diagnosis!”

But this doctor can still make progress. Imagine that she churns out sketches of expected scans for every possible disease in proportion to how often each disease occurs, and compares the sketches side by side to a patient’s scan. After a long night with pencil and paper, she finds that her imagined sketch of a hypothetical case of pneumonia looks suspiciously similar to the scan of a patient earlier that day (and other sketches have not matched). Suddenly, pneumonia became the most probable diagnosis.

This example shows that data can be analyzed simply by synthesizing exemplars and comparing each one. In spirit, this is what inference in a generative model is all about – finding the most likely (unobserved) causes for some (observed) effects by searching over all possible causes and considering (1) whether it is consistent with the observations, and (2) how likely the cause is a priori. In practice, there are much, much more efficient algorithms for inference, which will be described in more detail in future posts.

For now, I will end by suggesting that you take a few moments to introspect next time you get tricked by an every-day illusion. It happens all the time – we hear a distant sound or see something out of the corner of our eye and think we know what it is, then a moment later we reconsider and realize we’ve made a mistake. Next time this happens, ask yourself whether your first impression made sense in the context of generative models and inference. Did you jump to the first conclusion because it was simpler? Were you expecting one thing but encountered another? Could the “data” coming into your senses have plausibly been generated by both the first and second interpretation?

### Footnotes

1. This could mean an individual’s learning from experience, or coarser shaping of the system by evolution.
2. Sometimes the term “generative model” is used off-hand to mean the same thing as a “probabilistic model.” If you give me a joint distribution p(X,Y) of two variables X and Y, I can generate values of X consistent with the constraint that Y takes on a particular value y by evaluating p(X|Y=y). When we factorize the joint distribution into p(X|Y)p(Y), we say that we are modeling a process where Y generates X. Conversely, p(Y|X=x) can be used to generate values of Y consistent with X taking on the value x. However, there is a distinction to be made between simply factorizing a joint distribution in one way or another, as I just described, and having a true generative model. The former just describes correlations or statistical coincidences, while the latter describes causation of the form “if Y takes on this value, then X will take on these other values with some probability.” The distinction matters when an intervention can be made to perturb Y and we care about whether this will affect X. In the context of perception-as-inference, we typically have the latter type – a true causal model – in mind. It is unclear to me, however, if I would be any worse off with a purely correlational visual system. (In other words, it might not matter to my survival if I assume that my senses cause the world to exist in a particular state). Perhaps I will revisit this distinction in a future post.

### References

 Yuille, A., & Kersten, D. (2006). Vision as Bayesian inference: analysis by synthesis? Trends in Cognitive Sciences, 10(7), 301–308.

## Why probability?

Let’s start with a broad question: what makes probability theory the right tool to model perception? To put it simply, the world is much fuzzier and less certain than it seems. Given the chance to design a system that functions in an uncertain world, the optimal thing to do would be to have it explicitly reason about probable and improbable things. But if the brain is not designed per se, then is there more than human ego to make us think that it would function in a similarly optimal way? In a future post, I will outline the kinds of evidence there are for this, but for now let’s just assume (or let’s hypothesize) that the brain is at least trying to do the right thing, and is getting pretty close. This is known as the Bayesian Brain Hypothesis, and its roots go at least as far back as Hermann von Helmholtz, who in 1867 described perception as a process of “unconscious [probabilistic] inference”.

A similar story can be told for cognition: where are the crisp boundaries between concepts like ‘cup’ and ‘mug’? What makes something ‘art’? When is thought precise? Instead, the mind works largely by induction, generalization, simulation, and reasoning, all of which are naturally formalized in probabilistic terms.

It should be clarified that a probabilistic brain is not necessarily a random brain. If you are predicting the outcomes of a weighted coin flip that is heads 55% of the time and tails 45% of the time, the ideal probabilistic response would be to guess heads every time. What’s important is that a probabilistic system is not very confident in that guess. Later, we will see examples of ‘sampling’ algorithms where generating random numbers is a tool for reasoning probabilistically, but other algorithms achieve probabilistic answers with no randomness at all!

## Whence comes the uncertainty

Even in early sensory processing, the brain faces substantial uncertainty about the ‘true’ state of the world. I like to break this problem down into two parts that I call ambiguity and noise. Using visual terms,

1. ambiguity arises when many different ‘world states’ can give rise to exactly the same image.
2. noise refers to the fact that subtly different images might evoke the same activity in the brain. The chessboard image shown here is a classic example of how ambiguity arises and how the brain resolves it. White squares in the cylinder’s shadow are exactly the same color as dark squares outside the shadow, yet they are perceived differently. To put it another way, exactly the same image (a gray square) was created from different world states (light square + shadow or dark square + no shadow). It is easy to think that this is a trivial problem since we so quickly and effortlessly perceive the true nature of the scene, but these same mechanisms make us susceptible to other kinds of illusions.

Noise occurs partly because neurons are imperfect, so the same image on the retina evokes different activity in visual cortex at different times. Noise also occurs when irrelevant parts of a scene are changing, like dust moving on a camera lens, or small movements of an object while trying to discern what it is. These are called “internal noise” and “external noise” respectively. Strictly speaking, noise as described here is not by itself an issue; the same input may map to many different patterns of neural activity in the brain, but as long as we can invert the mapping there is no problem! The only time we cannot do this inversion is when the noise results in overlapping neural patterns for different states in the world.1 That is, noise is only an issue if it results in ambiguity! A, B, and C refer to different world states (different images) on the left, and different brain states (patterns of neural activity) on the right. Despite noise (gray ovals), C can be distinguished from both A and B. Where A and B overlap, the mapping from the world to the brain cannot be inverted, so there would be uncertainty in whether A or B was the cause.

Noise explains why there is a limit to our ability to make extremely fine visual distinctions, like the difference between a vertical line and a line tilted off of vertical by a small fraction of a degree – similar enough inputs will have indistinguishable patterns of neural activity.

Finally, it is important to note that an information bottleneck in the visual system also indirectly implies a kind of noise.2 Information bottlenecks arise whenever a ‘channel’ can take on fewer states than the messages sent across it; a classic example is that the optic nerve has too few axons to transmit the richness of all retinal patterns. Think of an information bottleneck as losing detail about a scene. The logic may seem backwards, but the lack of detail in the right implies that some scenes on the left are forced to become indistinguishable after passing through the bottleneck.

The fact that an information bottleneck implies uncertainty is counter-intuitive at first since it makes no statement about how any particular image or scene is affected.3 Perhaps it is more intuitive to think of an information bottleneck as a kind of continuous many-to-one mapping, where different inputs are forced to map to similar neural states. As seen in the overlap between states “A” and “B” in the second illustration above, a many-to-one mapping cannot be inverted, so there must be uncertainty about the true scene.

## Wrapping up

Wherever there is uncertainty, the optimal thing to do is to play the odds and think probabilistically. As I hope to have conveyed in this post, uncertainty about the ‘true’ state of the world is a ubiquitous problem for perception, though it may not seem so introspectively. Future posts will elaborate on the process of inference, which resolves such uncertainty and settles on the most likely interpretation(s) of an image or scene.

### Footnotes

1. advanced readers may recognize this as the logic behind information-limiting correlations.

2. for those familiar with information theory, this is because an upper bound on the mutual information between input (an image) and evoked neural activity implies a lower bound on the conditional entropy of the neural activity given the input. For those unfamiliar with information theory, stay tuned for a future post =)

3. rate distortion theory allows one to make more concrete statements about this mapping, but only having assumed a loss function that quantifies how bad it is to lose some details relative to others

### References

 Knill, D. C., & Pouget, A. (2004). The Bayesian brain: the role of uncertainty in neural coding and computation. Trends in Neurosciences, 27(12), 712–9. http://doi.org/10.1016/j.tins.2004.10.007

 von Helmholtz, Hermann. (1867). Handbuch der physiologischen Optik.

 Moreno-Bote, R., Beck, J. M., Kanitscheider, I., Pitkow, X., Latham, P., & Pouget, A. (2014). Information-limiting correlations. Nature Neuroscience, 17(10), 1410–1417. http://doi.org/10.1038/nn.3807

## Hello World

Hello World! This is a bimonthly blog about “the brain”, through the lens of computational models of cognition and perception.

Why Box & Arrow Brain?
The earliest “models” of cognition consisted of boxes and arrows. In case you’ve missed them, here are a few examples: Sternberg, S. (1969). Memory-scanning: Mental processes revealed by reaction-time experiments. American Scientist, 57(4), 421-457. Meyer, D. E. Schvaneveldt, R. W., & Ruddy, MG (1975). Loci of contextual effects on visual word recognition. Attention and Performance, 98-118. Meck, W. H., & Church, R. M. (1983). A mode control model of counting and timing processes. Journal of Experimental Psychology: Animal Behavior Processes, 9(3), 320.

Box and arrow models are intuitive, easy to draw, and are generally a good starting point for understanding a complex system; however, by modern standards they are horribly imprecise, often relying on implicit (or unclear) theoretical and philosophical commitments. Modern/surviving theories fill in these gaps by formalizing mathematical architectures or explicitly stating “linking functions” (including data analysis assumptions). This is a step in the right direction, but we have by no means reached the finish line.

In this blog, we have two goals:
1) To introduce modern cognitive modeling frameworks via tutorials; and
2) To provide some background and commentary on some of the theoretical and philosophical commitments that often go unmentioned.

We have two contributing authors. As of the inception of this blog:

Richard Lange did some computer science things for a while, then got interested in artificial intelligence and philosophy of mind, but realized nobody really knows how brains work (which would be a good first step), and now studies visual perception in humans and monkeys.

Frank Mollica is a person. He spends far too much time (and yet still not enough) thinking about language, concepts and contexts. Follow him on twitter @FrancisMollica because shameless self-promotion (and new friends/enemies?).