## Why Probability? (Part 2)

It seems that the vast majority of mathematical formalisms in cognitive science, psychophysics, and neuroscience almost always deal with probability theory in one way or another. This includes formalisms for how the brain models the world using probabilistic models, how constant stimuli elicit variable neural and behavioral responses, how words and concepts map onto the world, and more.

The thing about formalisms, though, is that they are often open to more than one interpretation. On its home turf of formal mathematics, probability theory makes statements about the behavior of sets of things (like the set of all integers), functions on those sets (like $P(k)=0.34$), and other things you can derive from them. What it doesn’t say is whether a statement like $P(number\_of\_puppies > 1)=0.3$ tells you anything about the real-world possibility of encountering two or more puppies. On the one hand, there is the formal mathematics stating that there is a random variable $number\_of\_puppies$ with a defined probability mass function $P$ which, when integrated over all integers larger than $1$, evaluates to $0.3$. On the other hand, there either are adorable small dogs or there are not, and it seems that this should be a closed case. The mathematics gives us a valid statement; but interpretation of such a statement as it pertains to real-world events is in the realm of philosophy. The same goes for neuroscientists and cognitive scientists – we need to be sure, independently of the mathematics, that statements about probability are grounded in reality. This means that first we need to have a solid understanding of what probability means.

Anybody using probability theory (or information theory, decision theory, signal detection theory, rate distortion theory, etc.) ought to grapple with these philosophical questions. In fact, everybody should do this because it’s interesting and fun. In the remainder of this post, I will attempt to outline my own perspective.

## True Randomness vs Ignorance

tl;dr we use probabilistic models not because something is random, but because we’re missing some information about it.

Although it’s intuitive to think that the ideas of probability only bear on random events, here I’ll try to convince you that approaching a problem probabilistically is more an admission of ignorance than a claim of randomness. Or, if you like, to claim that a system is random is a claim of ignorance. Note that by “random” I don’t mean “completely unpredictable.” For example, if you roll and sum two standard dice, the result is both “random” as well as more likely to be a 7 than a 2. That is, $$P(die_1+die_2=7) > P(die_1+die_2=2)$$

Speaking of dice, probability theory has its historical roots in gambling. Back in the 17th century, there was an immediate practical application to the idea that, when you roll a 6-sided die, there is a 1/6 chance that each number comes up, rather than attributing success to luck or to fate.

Gambling examples help reveal the dual nature of probability as both a frequentist count of the fraction of times an event happens after many, many repeats, and as a belief about single events that haven’t happened yet.

In the frequentist perspective, the statement $P(die=4)=1/6$ is a claim that, repeatedly rolling the die under the same conditions $N$ times, the fraction of times $4$ appeared will approach $1/6$ as $N$ gets large. In this view, it makes little sense to talk about the probability of a single isolated event, just as the statement $P(number\_of\_puppies > 1)=0.3$ could not be interpreted as a “yes” or “no” answer to the Big Questions, “will there be puppies? here? soon?”

This brings us to the notion of probability as a belief. Under this interpretation, we don’t need $N$ repeats of the same circumstances as $N$ gets large. Instead, probability is like a graded prediction for each future event. Based on the event’s actual outcome, it affords us a graded level of satisfaction (if the actual event was predicted with high probability) or surprise (if the event was predicted with low probability).1

Frequencies and belief perhaps should be related, but there is no mathematical law requiring them to be. Again, the interpretation of formal mathematics is outside the realm of formal mathematics. There are, however, plenty of practical and epistemological arguments why probability as a belief should be calibrated to probability as a frequency. You may strongly believe that my dice are loaded and that $3$ is the most likely number. But if it is truly a fair die, then I will always win money in the long run by having beliefs that are closer to the true frequencies (assuming bets are based on beliefs).2 Calibration of probabilities as beliefs is a practical concern in machine learning; it is dangerous for a virtual doctor to make poorly calibrated diagnoses, even if it typically gets the “most likely” answer correct. For example, if a patient is 55% likely to have disease A and 45% likely to have disease B, but a model spits out 90% for A and 10% for B, then doctors may proceed over-confidently with the wrong treatment. Even though the math police do not make arrests for poorly calibrated beliefs, it’s still a good idea to get them right.

Perhaps the best way to calibrate your beliefs would be to take the frequentist approach and repeat an experiment $N$ times and simply count how many times each outcome occurs. Immediately you run head-first into another philosophical conundrum: what counts as a repeat? Let’s look at the dice example more closely. Can you roll a jar full of different colored dice, or do you need to roll the same die each time? Does it count if you roll it on different surfaces? If each of your $N$ rolls is under slightly different circumstances, then are you justified in aggregating all the results together when estimating frequencies? Clearly there are many factors influencing the outcome of the die (see the next box-and-arrow diagram)

If we let all of the “causes” in this diagram vary freely, we’ll see that the results are, well, fairly unpredictable.

Next, you might (rightly) suggest that mis-shapen dice into our experiment should not count as “repeats” of the experiment. Remember: the goal of the experiment is to estimate the actual frequencies of a 6-sided die. Let’s see what happens when we control for the shape.

But why stop there? There is quite a bit of variability between the surfaces still. Might that affect the results? (In the above animations, reddish = bouncy and blueish = slippery). Let’s control for surface variety.

Great. Surface variations are now accounted for. Still, our die experiment still has a lurking cause. Let’s deal with the final variable.

Hmm – we get the same roll every time. Dice are the gold standard of random. They show up in every probability theory textbook. Probability was invented to make sense of them. Yet, given more and more control of the context, their randomness evaporates. Much of the world is, of course, deterministic in this sense. The frequentist idea of probability requires that we precisely repeat an experiment $N$ times, just not too precisely. This is not at all a criticism of the frequentist perspective; having well-calibrated beliefs about a well-controlled die would, of course, mean predicting that the result is almost always the same (conditioning beliefs on some additional knowledge is akin to controlling more conditions of a frequentist’s experiment).

So, dice are not really “random” after all, as long as we know enough about the context and have all the computing power in the world to aid us in making our predictions. This is where probability comes in. The reason we say a die has a 1/6 probability of landing on each number is because we never roll it the same way twice, and we never know the exact physical parameters of each roll. Probability theory allows us to make sense of the world despite our ignorance of (or inability to compute) the minutiae that makes every event unique.

(In case it is not clear by now, I am using the term “ignorance” to simply mean missing some information about a system. Complex systems have many interacting parts; if only some of them are observed, then the unobserved parts may impart forces that appear “random” in many cases.)

##### Quick side-note on randomness in quantum physics for those who are interested…

Dice may be the gold standard of randomness to statisticians, but the quantum world is the only “true” source of randomness to a physicist. Modern physics has largely accepted this philosophy that “uncertainty” in quantum physics is random in the truest sense of the word. Take the Heisenberg uncertainty principle, which states that the better you know a particle’s location the less you can know about its momentum (or vice versa); there is no “less ignorant” observer who could, even in principle, make a more accurate prediction. This is true randomness. Einstein, for what it’s worth, always hoped that quantum physics might some day be understood in deterministic terms, where what we currently think of as “random” might be explained as the workings of yet smaller or stranger particles “behind the scenes” that we simply haven’t observed yet. That is, we see the quantum world as “random” because we’re ignorant of some other hidden aspect of the universe.3 When Einstein wrote

God does not play dice with the universe.

clearly he had in mind a more random kind of die than I’ve described here. Perhaps he meant

God does play dice with the universe, but we can’t see how the dice are rolled (yet)

## Are neurons really random? Or, what did one brain area say to the other?

tl;dr maybe, but given everything above, it’s a moot point regardless.

A commonplace in neuroscience textbooks is the statement that neurons are stochastic. The classic example is that the same exact image can be presented on a screen many times, but neurons in visual cortex never seem to respond the same way twice, even when reducing our measurements to something as simple as the total spike count ($\mathbf{r}$). Models typically assume that the stimulus – and some other factors – set the mean firing rate of the neuron ($\lambda$), but that spikes are Poisson-distributed given that mean rate:
$$P(\mathbf{r}) = \frac{\lambda^\mathbf{r}e^{-\lambda}}{\mathbf{r}!}$$
As with the dice above, we can ask what unobserved factors contribute to the apparent randomness of sensory neurons. A back-of-the-envelope sketch might be the following:

Sure enough, there is evidence that the “randomness” of sensory neurons begins to evaporate the better we control for eye movements, fluctuating background attention levels, etc. The more we understand and can control for these latent factors influencing sensory neurons, the less the Poisson distribution will be the appropriate choice for computational neuroscientists (and a few alternatives have been proposed under the banner of “sub-Poisson variability” models).

The 1995 study by Mainen and Sejnowski4 tells this story beautifully in a single figure:

Just like the dice above, given enough control of its inputs, the randomness of single-neurons seems to evaporate. Does this mean that probability is the wrong tool for understanding neural coding? Absolutely not! In the dice example, perhaps we should have stopped after controlling for the shapes of the dice. Let the rest be random.

Similarly, there is a “sweet spot” to the amount of control we should exact in our models of neural coding. Too little, and we underestimate their information-coding capabilities. Neurons with super-Poisson variability may be an indication that we are in this regime. Similarly (and perhaps surprisingly), too much precision may over-estimate the information content of a neural population. Fitting a model of individual neurons’ spike times and how they depend on the stimulus may be an indication that we are in this regime. Between these extremes is a balance between controlling for things that should be controlled and summarizing things where details are irrelevant.

If you are not convinced that having “too good” a model is a problem, consider this: how much of the information in one brain are ever makes it out of that brain area? Just as we needed to know extremely precise details of how the die was thrown (or of the inputs to a single neuron) to make precise predictions, for one brain area to “use” all of the information in another brain area would mean that an incredible amount of detail would need to be communicated between the two. Experimental evidence suggests, however, that cortical brain areas typically communicate via the average firing rate of cells, glossing over details like individual cells’ timing or the synaptic states of local circuits. These are all irrelevant details from the perspective of a downstream area. If we, as neuroscientists, want to understand how the brain encodes and transmits information, then using probabilistic models is not simply a matter of laziness or imprecision. Probabilistic models are how the experimenter to plays the role of a homunculus “looking at” the output of another brain area, because what one brain area tells the other is only ever a summary of the complex inner-workings of neural circuits.

## Recapitulation

1.  we use probabilistic models not because something is random, but because we’re missing some information about it. Or, if you like, this is really what we mean when we say something is random.
2. some things are just too complex to model even with all necessary information, so we fall back on probabilistic models.
3. maybe individual neurons are somewhat stochastic, but it’s a moot point regardless, since…
4. what brain area A tells brain area B is limited. What neuroscientists measure about brain area A is analogously limited. How neuroscientists “decode” this limited information is a better model of how brain areas communicate than more detailed encoding models (in certain situations).

## Footnotes

1. Information Theory adpots this philosophy and formalizes surprise as the negative log probability of an event.
2. This betting argument is based on the Dutch Book Argument.
3. Disclaimer: I am certainly not a physicist, so I apologize for any blatant misrepresentations here.
4. Mainen, Z. and Sejnowski, T. (1995) “Reliability of spike timing in neocortical neurons.” Science 268(5216):1503-6.

## Generative Models in Perception

I started this tutorial on “perception-as-inference” in the last post with the idea that – through the mechanisms of  ambiguity and noise – the world enters the mind through the senses in an incomplete form, lacking a clear and unambiguous interpretation. I hinted that perception may engage an inference process, using its prior experience in the world to settle on a particular likely interpretation of a scene (or perhaps a distribution of likely interpretations). In this school of thought, perception itself is the result of an inference process deciding on one likely interpretation of sensory data or another. The key function of sensory neurons in the brain would then be computing and evaluating probability distributions of plausible “features” of the world.

But… there is no one way to build a probabilistic model. How does the brain know what “features” to look for? For example, how does it decide to sort out the cacophony of electrical signals coming from the optic nerve in terms of objects, lights, textures, and everything else that makes up our visual experience? When listening to music, how does it decide to interpret vibrations of the eardrum as voices and instruments? One appealing hypothesis is that the brain learns1  generative models for its stream of sensory data, which can be thought of as a particular type of probabilistic model that captures cause and effect2. In our visual example, objects, lights, and textures are the causes, and electrical signals in the visual system are the effects. Inference is the reverse process of reasoning about causes after observing effects. More on this below…

## Generative Models, Latent Variables

Given some complicated observed data, a generative model posits that there exists a set of unobserved states of the world as the underlying cause of what is being seen or measured. Let’s take a more relatable example. When radiologists learn to read X-rays, they could learn to directly correlate the patterns of splotches in the image pixel by pixel with possible adverse health symptoms; but this would not be a very good use of their time. Instead, they learn how diseases cause both adverse health symptoms and patterns of scan splotches. The ailment or disease may never be observed directly (it is a latent variable), but it may be inferred since the doctor knows how the disease manifests in observable things – i.e. the doctor has learned a generative model of X-ray images and symptoms conditioned on possible diseases.

To perceive objects in a scene, your brain solves an analogous problem. In the visual example, the impulses in your optic nerve are the “symptoms” and any objects, people,or shapes you perceive are the root cause of them – the “disease” (no offense to objects, people, and shapes). That is, we reason about visual things in terms of objects because our visual system has implicitly learned a (generative) process in which objects cause signals in the eye. This process involves photons bouncing off the object then passing through the eye, the transduction of those photons into electrical signals of retinal rod and cone cells, some further retinal preprocessing of those signals, and eventually relaying them down the optic nerve to the rest of the brain. Suffice it to say, it is complicated. Now imagine trying to invert that whole process, going from nerve signals back to objects, and you might gain a new appreciation for what your visual system does every waking moment of your life!

### Analysis by Synthesis

One intuitive, though not very effective way of doing inference with generative models is the idea of analysis by synthesis[1]. Using the X-ray example from above, imagine the life of the frazzled doctor who memorized a procedure for sketching drawings of what different diseases might look like on a scan, but has not yet figured out how to go in the other direction – i.e. to look at a scan and jump to a diagnosis. “Surely,” the doctor thinks, cursing her backwards education, “they should not have taught us how diseases cause symptoms, when what we really care about is the other way around – making a diagnosis!”

But this doctor can still make progress. Imagine that she churns out sketches of expected scans for every possible disease in proportion to how often each disease occurs, and compares the sketches side by side to a patient’s scan. After a long night with pencil and paper, she finds that her imagined sketch of a hypothetical case of pneumonia looks suspiciously similar to the scan of a patient earlier that day (and other sketches have not matched). Suddenly, pneumonia became the most probable diagnosis.

This example shows that data can be analyzed simply by synthesizing exemplars and comparing each one. In spirit, this is what inference in a generative model is all about – finding the most likely (unobserved) causes for some (observed) effects by searching over all possible causes and considering (1) whether it is consistent with the observations, and (2) how likely the cause is a priori. In practice, there are much, much more efficient algorithms for inference, which will be described in more detail in future posts.

For now, I will end by suggesting that you take a few moments to introspect next time you get tricked by an every-day illusion. It happens all the time – we hear a distant sound or see something out of the corner of our eye and think we know what it is, then a moment later we reconsider and realize we’ve made a mistake. Next time this happens, ask yourself whether your first impression made sense in the context of generative models and inference. Did you jump to the first conclusion because it was simpler? Were you expecting one thing but encountered another? Could the “data” coming into your senses have plausibly been generated by both the first and second interpretation?

### Footnotes

1. This could mean an individual’s learning from experience, or coarser shaping of the system by evolution.
2. Sometimes the term “generative model” is used off-hand to mean the same thing as a “probabilistic model.” If you give me a joint distribution p(X,Y) of two variables X and Y, I can generate values of X consistent with the constraint that Y takes on a particular value y by evaluating p(X|Y=y). When we factorize the joint distribution into p(X|Y)p(Y), we say that we are modeling a process where Y generates X. Conversely, p(Y|X=x) can be used to generate values of Y consistent with X taking on the value x. However, there is a distinction to be made between simply factorizing a joint distribution in one way or another, as I just described, and having a true generative model. The former just describes correlations or statistical coincidences, while the latter describes causation of the form “if Y takes on this value, then X will take on these other values with some probability.” The distinction matters when an intervention can be made to perturb Y and we care about whether this will affect X. In the context of perception-as-inference, we typically have the latter type – a true causal model – in mind. It is unclear to me, however, if I would be any worse off with a purely correlational visual system. (In other words, it might not matter to my survival if I assume that my senses cause the world to exist in a particular state). Perhaps I will revisit this distinction in a future post.

### References

[1] Yuille, A., & Kersten, D. (2006). Vision as Bayesian inference: analysis by synthesis? Trends in Cognitive Sciences, 10(7), 301–308.

## Why probability?

Let’s start with a broad question: what makes probability theory the right tool to model perception? To put it simply, the world is much fuzzier and less certain than it seems. Given the chance to design a system that functions in an uncertain world, the optimal thing to do would be to have it explicitly reason about probable and improbable things. But if the brain is not designed per se, then is there more than human ego to make us think that it would function in a similarly optimal way? In a future post, I will outline the kinds of evidence there are for this, but for now let’s just assume (or let’s hypothesize) that the brain is at least trying to do the right thing, and is getting pretty close. This is known as the Bayesian Brain Hypothesis,[1] and its roots go at least as far back as Hermann von Helmholtz, who in 1867 described perception as a process of “unconscious [probabilistic] inference”.[2]

A similar story can be told for cognition: where are the crisp boundaries between concepts like ‘cup’ and ‘mug’? What makes something ‘art’? When is thought precise? Instead, the mind works largely by induction, generalization, simulation, and reasoning, all of which are naturally formalized in probabilistic terms.

It should be clarified that a probabilistic brain is not necessarily a random brain. If you are predicting the outcomes of a weighted coin flip that is heads 55% of the time and tails 45% of the time, the ideal probabilistic response would be to guess heads every time. What’s important is that a probabilistic system is not very confident in that guess. Later, we will see examples of ‘sampling’ algorithms where generating random numbers is a tool for reasoning probabilistically, but other algorithms achieve probabilistic answers with no randomness at all!

## Whence comes the uncertainty

Even in early sensory processing, the brain faces substantial uncertainty about the ‘true’ state of the world. I like to break this problem down into two parts that I call ambiguity and noise. Using visual terms,

1. ambiguity arises when many different ‘world states’ can give rise to exactly the same image.
2. noise refers to the fact that subtly different images might evoke the same activity in the brain.

The chessboard image shown here is a classic example of how ambiguity arises and how the brain resolves it. White squares in the cylinder’s shadow are exactly the same color as dark squares outside the shadow, yet they are perceived differently. To put it another way, exactly the same image (a gray square) was created from different world states (light square + shadow or dark square + no shadow). It is easy to think that this is a trivial problem since we so quickly and effortlessly perceive the true nature of the scene, but these same mechanisms make us susceptible to other kinds of illusions.

Noise occurs partly because neurons are imperfect, so the same image on the retina evokes different activity in visual cortex at different times. Noise also occurs when irrelevant parts of a scene are changing, like dust moving on a camera lens, or small movements of an object while trying to discern what it is. These are called “internal noise” and “external noise” respectively. Strictly speaking, noise as described here is not by itself an issue; the same input may map to many different patterns of neural activity in the brain, but as long as we can invert the mapping there is no problem! The only time we cannot do this inversion is when the noise results in overlapping neural patterns for different states in the world.1 That is, noise is only an issue if it results in ambiguity!

Noise explains why there is a limit to our ability to make extremely fine visual distinctions, like the difference between a vertical line and a line tilted off of vertical by a small fraction of a degree – similar enough inputs will have indistinguishable patterns of neural activity.

Finally, it is important to note that an information bottleneck in the visual system also indirectly implies a kind of noise.2 Information bottlenecks arise whenever a ‘channel’ can take on fewer states than the messages sent across it; a classic example is that the optic nerve has too few axons to transmit the richness of all retinal patterns.

The fact that an information bottleneck implies uncertainty is counter-intuitive at first since it makes no statement about how any particular image or scene is affected.3 Perhaps it is more intuitive to think of an information bottleneck as a kind of continuous many-to-one mapping, where different inputs are forced to map to similar neural states. As seen in the overlap between states “A” and “B” in the second illustration above, a many-to-one mapping cannot be inverted, so there must be uncertainty about the true scene.

## Wrapping up

Wherever there is uncertainty, the optimal thing to do is to play the odds and think probabilistically. As I hope to have conveyed in this post, uncertainty about the ‘true’ state of the world is a ubiquitous problem for perception, though it may not seem so introspectively. Future posts will elaborate on the process of inference, which resolves such uncertainty and settles on the most likely interpretation(s) of an image or scene.

### Footnotes

1. advanced readers may recognize this as the logic behind information-limiting correlations.[3]

2. for those familiar with information theory, this is because an upper bound on the mutual information between input (an image) and evoked neural activity implies a lower bound on the conditional entropy of the neural activity given the input. For those unfamiliar with information theory, stay tuned for a future post =)

3. rate distortion theory allows one to make more concrete statements about this mapping, but only having assumed a loss function that quantifies how bad it is to lose some details relative to others

### References

[1] Knill, D. C., & Pouget, A. (2004). The Bayesian brain: the role of uncertainty in neural coding and computation. Trends in Neurosciences, 27(12), 712–9. http://doi.org/10.1016/j.tins.2004.10.007

[2] von Helmholtz, Hermann. (1867). Handbuch der physiologischen Optik.

[3] Moreno-Bote, R., Beck, J. M., Kanitscheider, I., Pitkow, X., Latham, P., & Pouget, A. (2014). Information-limiting correlations. Nature Neuroscience, 17(10), 1410–1417. http://doi.org/10.1038/nn.3807

## Hello World

Hello World! This is a bimonthly blog about “the brain”, through the lens of computational models of cognition and perception.

Why Box & Arrow Brain?
The earliest “models” of cognition consisted of boxes and arrows. In case you’ve missed them, here are a few examples:

Box and arrow models are intuitive, easy to draw, and are generally a good starting point for understanding a complex system; however, by modern standards they are horribly imprecise, often relying on implicit (or unclear) theoretical and philosophical commitments. Modern/surviving theories fill in these gaps by formalizing mathematical architectures or explicitly stating “linking functions” (including data analysis assumptions). This is a step in the right direction, but we have by no means reached the finish line.

In this blog, we have two goals:
1) To introduce modern cognitive modeling frameworks via tutorials; and
2) To provide some background and commentary on some of the theoretical and philosophical commitments that often go unmentioned.

We have two contributing authors. As of the inception of this blog:

Richard Lange did some computer science things for a while, then got interested in artificial intelligence and philosophy of mind, but realized nobody really knows how brains work (which would be a good first step), and now studies visual perception in humans and monkeys.

Frank Mollica is a person. He spends far too much time (and yet still not enough) thinking about language, concepts and contexts. Follow him on twitter @FrancisMollica because shameless self-promotion (and new friends/enemies?).