How many times did Pavlov ring the bell before his dogs’ meals until the dogs began to salivate? Surely, the number of experiences must make a difference, as anyone who’s trained a dog would attest. As described in a brilliant article by C.R. Gallistel (in Psych. Review; preprint here), this has been thought so self-evident “as to not require experimental demonstration” – yet information theoretic analysis suggest the idea is incorrect, at least when the time from the bell to the food is constant. More problematic is the fact that the whole issue is ill-formed for experimental verification: technically speaking, one can never actually accept the (null) hypothesis that some experimental manipulation has no effect. But as Gallistel says, while “conventional statistical analysis cannot support [the null hypothesis]; Bayesian analysis can.”
First, some basics. Everyone knows that conventional statistics are judged (largely) on the basis of p-values – and that these p-values indicate the probability of seeing the data you observed if the null hypothesis is actually true. Thus, one might think that a p value of .5 (50%) means that the null is just as likely to be correct as incorrect, but this is completely wrong: as elegantly discussed here, p-values constitute evidence against the null, and a lack of evidence against something is not evidence for it! (This logical fallacy is called the argument from ignorance). Of course, with high enough sample size or statistical power, non-significant p-values may suggest that there is no large non-null effect … but again that does not constitute evidence for the null! (See the quick exposition here about why this method can’t be used to prove the null.)
How do Bayesian statistics allow one to prove the null? I think this is best illustrated with the example provided by Gallistel for a Bayesian version of a two-sample t-test, paraphrased here. I’ve kept much of the terminology used by “the Bayesians,” as opposed to using less formal language, since the below mathematical steps define these otherwise arcane terms. In addition, this process has been implemented in matlab, in javascript, and in Excel VBA (just updated last week), for those who prefer to understand by reading code and/or tinkering. (EDIT: but see Bob’s important caveats about this particular approach).
First, we need to assume a statistical model (gaussian, bernoulli, etc – we’ll say gaussian) characterizing our data. Second, we need to specify the “null prior” using the parameters defining that statistical model – in our gaussian case, the mean and standard deviation of one of the two samples (only one, since they are supposed to be the same under the null hypothesis). In a sense, this null prior is the uncertainty we have about what we’d expect if there is no effect. Third, we need to specify the largest plausible effect size we could observe (this can be a little subjective, although the space of possible effect sizes can be systematically explored, as described below), and convolve this with the null prior distribution to arrive at the distribution we’d expect to see under the alternative hypothesis – the “alternative prior.” Fourth, we calculate the likelihood function of the data, which is the probability of observing these data for each possible value of the parameters defining our statistical model (in our gaussian case, mean and standard deviation), and multiply this with our two prior distributions (the null prior as specified in step 2 and the alternative prior as specified in step 3) yielding two “posteriors.” Take their integrals (which yield two marginal likelihoods – the extent to which each prior fits the observed likelihood function), and calculate their ratios. This ratio may be used to calculate evidence in favor of the null.
Perhaps a more concrete example would be helpful – here I’ll summarize Gallistel’s analysis of data from classical conditioning to show that, in contradiction to numerous models of the phenomenon, acquisition in classical conditioning occurs regardless of the number of trials (in colloquial terms, the number of times Pavlov rang the bell before feeding his dog had no effect on the dog’s salivation! At least, where times >1). Gallistel reanalyzed data from a previous experiment showing that the block during which an animal first showed a conditioned response would be the same, regardless of whether those blocks had 4 or 32 different training trials.
Step 1: assume a statistical model. The number of blocks to acquisition looked very normally distributed, so Gallistel assumes a Gaussian model.
Step 2: specify the null prior. Gallistel estimates a standard deviation on the pooled data (it can be pooled because Gallistel assumes equal variance between the two samples – the same as in the run-of-the-mill t-test in “frequentist” statistics).
Step 3: specify the largest plausible effect size. Gallistel specifies two alternative models – one in which trials matter linearly (such that it should take the animals in the 4-trials per block group 8 times as long as those in the 32-trials per block group), and one in which they matter sublinearly (it should take the animals in the 4-trials per block group somewhere between 1 and 8 times as long).
Steps 4 & 5: Calculate the likelihood of the data and multiply to get the posteriors; integrate to get the marginal likelihoods, and calculate their ratio to evaluate evidence for or against the null. Even before integrating, the answer was obvious: Gallistel finds that the posterior probability density function of the null hypothesis mimics the likelihood function of the data almost exactly. Nonetheless, after integrating and calculating the ratio of marginal likelihoods, the evidence is at least 32:1 in favor of the null over the alternative hypothesis that the number of trials has a sublinear effect on conditioning. When using the more restricted linear alternative, the odds in favor of the null increase to 216:1.
This highlights one of Gallistel’s themes: the vagueness of the non-null hypothesis has a large influence on the odds of the null hypothesis being true. That is, very “vague” alternatives will allow a wide range of effect sizes and therefore predict a number of different outcomes with equal probability; correspondingly, the probability of one of those outcomes occurring is less, and the relative odds on the null (given some outcome) will increase, all else being equal. Given this “problem of the prior,” Gallistel recommends calculating the odds on the null as a function of many different plausible effect sizes; in this way, one can incrementally reduce the largest plausible effect size to see if it ever produces a better fit than the null. In the case of Pavlov’s dogs, Gallistel shows that “the null is literally unbeatable.” (EDIT, again, if you didn’t look yet, read Bob’s comment on this method).
For those who would protest this conclusion, Gallistel offers two additional examples of how this method can show odds against the null, or odds only weakly in favor of the null. To my eyes, the method seems very sound, and the implications profound. The implications for computational models of conditioning will be discussed in a future post. For now, however, it’s enough to say that it’s not necessary to “give up” on testing null hypotheses: Gallistel has provided a very clear Bayesian recipe for proving the null. Null hypotheses should no longer be considered a no-man’s land in theory development.
Some say scientists fall into two categories: the “lumpers” and the “splitters.” The lumpers prefer to gloss over what they see as unimportant distinctions and look at the emerging big picture in a particular field, whereas splitters tend to assume every measurable difference matters. In general, it seems to me that science is biased towards the splitters, not only because of science’s traditionally reductionist method (as opposed to the more computationally-oriented “reconstructionist” method, which I’ve written about before) but also because our statistical tools could only be used to support hypotheses developed by the “splitters.” That is, traditional statistics could only definitively tell us when two things are different, but not when they are the same. Maybe this kind of Bayesian method for “proving the null” could be used to achieve a better balance.