Pavlov's Dogs: Proving the Null With Bayesianism

How many times did Pavlov ring the bell before his dogs' meals until the dogs began to salivate? Surely, the number of experiences must make a difference, as anyone who's trained a dog would attest. As described in a brilliant article by C.R. Gallistel (in Psych. Review; preprint here), this has been thought so self-evident "as to not require experimental demonstration" - yet information theoretic analysis suggest the idea is incorrect, at least when the time from the bell to the food is constant. More problematic is the fact that the whole issue is ill-formed for experimental verification: technically speaking, one can never actually accept the (null) hypothesis that some experimental manipulation has no effect. But as Gallistel says, while "conventional statistical analysis cannot support [the null hypothesis]; Bayesian analysis can."

First, some basics. Everyone knows that conventional statistics are judged (largely) on the basis of p-values - and that these p-values indicate the probability of seeing the data you observed if the null hypothesis is actually true. Thus, one might think that a p value of .5 (50%) means that the null is just as likely to be correct as incorrect, but this is completely wrong: as elegantly discussed here, p-values constitute evidence against the null, and a lack of evidence against something is not evidence for it! (This logical fallacy is called the argument from ignorance). Of course, with high enough sample size or statistical power, non-significant p-values may suggest that there is no large non-null effect ... but again that does not constitute evidence for the null! (See the quick exposition here about why this method can't be used to prove the null.)

How do Bayesian statistics allow one to prove the null? I think this is best illustrated with the example provided by Gallistel for a Bayesian version of a two-sample t-test, paraphrased here. I've kept much of the terminology used by "the Bayesians," as opposed to using less formal language, since the below mathematical steps define these otherwise arcane terms. In addition, this process has been implemented in matlab, in javascript, and in Excel VBA (just updated last week), for those who prefer to understand by reading code and/or tinkering. (EDIT: but see Bob's important caveats about this particular approach).

First, we need to assume a statistical model (gaussian, bernoulli, etc - we'll say gaussian) characterizing our data. Second, we need to specify the "null prior" using the parameters defining that statistical model - in our gaussian case, the mean and standard deviation of one of the two samples (only one, since they are supposed to be the same under the null hypothesis). In a sense, this null prior is the uncertainty we have about what we'd expect if there is no effect. Third, we need to specify the largest plausible effect size we could observe (this can be a little subjective, although the space of possible effect sizes can be systematically explored, as described below), and convolve this with the null prior distribution to arrive at the distribution we'd expect to see under the alternative hypothesis - the "alternative prior." Fourth, we calculate the likelihood function of the data, which is the probability of observing these data for each possible value of the parameters defining our statistical model (in our gaussian case, mean and standard deviation), and multiply this with our two prior distributions (the null prior as specified in step 2 and the alternative prior as specified in step 3) yielding two "posteriors." Take their integrals (which yield two marginal likelihoods - the extent to which each prior fits the observed likelihood function), and calculate their ratios. This ratio may be used to calculate evidence in favor of the null.

Perhaps a more concrete example would be helpful - here I'll summarize Gallistel's analysis of data from classical conditioning to show that, in contradiction to numerous models of the phenomenon, acquisition in classical conditioning occurs regardless of the number of trials (in colloquial terms, the number of times Pavlov rang the bell before feeding his dog had no effect on the dog's salivation! At least, where times >1). Gallistel reanalyzed data from a previous experiment showing that the block during which an animal first showed a conditioned response would be the same, regardless of whether those blocks had 4 or 32 different training trials.

Step 1: assume a statistical model. The number of blocks to acquisition looked very normally distributed, so Gallistel assumes a Gaussian model.
Step 2: specify the null prior. Gallistel estimates a standard deviation on the pooled data (it can be pooled because Gallistel assumes equal variance between the two samples - the same as in the run-of-the-mill t-test in "frequentist" statistics).
Step 3: specify the largest plausible effect size. Gallistel specifies two alternative models - one in which trials matter linearly (such that it should take the animals in the 4-trials per block group 8 times as long as those in the 32-trials per block group), and one in which they matter sublinearly (it should take the animals in the 4-trials per block group somewhere between 1 and 8 times as long).
Steps 4 & 5: Calculate the likelihood of the data and multiply to get the posteriors; integrate to get the marginal likelihoods, and calculate their ratio to evaluate evidence for or against the null. Even before integrating, the answer was obvious: Gallistel finds that the posterior probability density function of the null hypothesis mimics the likelihood function of the data almost exactly. Nonetheless, after integrating and calculating the ratio of marginal likelihoods, the evidence is at least 32:1 in favor of the null over the alternative hypothesis that the number of trials has a sublinear effect on conditioning. When using the more restricted linear alternative, the odds in favor of the null increase to 216:1.

This highlights one of Gallistel's themes: the vagueness of the non-null hypothesis has a large influence on the odds of the null hypothesis being true. That is, very "vague" alternatives will allow a wide range of effect sizes and therefore predict a number of different outcomes with equal probability; correspondingly, the probability of one of those outcomes occurring is less, and the relative odds on the null (given some outcome) will increase, all else being equal. Given this "problem of the prior," Gallistel recommends calculating the odds on the null as a function of many different plausible effect sizes; in this way, one can incrementally reduce the largest plausible effect size to see if it ever produces a better fit than the null. In the case of Pavlov's dogs, Gallistel shows that "the null is literally unbeatable." (EDIT, again, if you didn't look yet, read Bob's comment on this method).

For those who would protest this conclusion, Gallistel offers two additional examples of how this method can show odds against the null, or odds only weakly in favor of the null. To my eyes, the method seems very sound, and the implications profound. The implications for computational models of conditioning will be discussed in a future post. For now, however, it's enough to say that it's not necessary to "give up" on testing null hypotheses: Gallistel has provided a very clear Bayesian recipe for proving the null. Null hypotheses should no longer be considered a no-man's land in theory development.

Some say scientists fall into two categories: the "lumpers" and the "splitters." The lumpers prefer to gloss over what they see as unimportant distinctions and look at the emerging big picture in a particular field, whereas splitters tend to assume every measurable difference matters. In general, it seems to me that science is biased towards the splitters, not only because of science's traditionally reductionist method (as opposed to the more computationally-oriented "reconstructionist" method, which I've written about before) but also because our statistical tools could only be used to support hypotheses developed by the "splitters." That is, traditional statistics could only definitively tell us when two things are different, but not when they are the same. Maybe this kind of Bayesian method for "proving the null" could be used to achieve a better balance.

More like this

Jeremy Miles pointed me to this article by Leonhard Held with what might seem like an appealing brew of classical, Bayesian, and graphical statistics: P values are the most commonly used tool to measure evidence against a hypothesis. Several attempts have been made to transform P values to minimum…
Marilyn Mann pointed me to an interesting post by David Rind over at Evidence in Medicine (thanks!). It's a follow-on to an earlier post of his about the importance of plausibility in interpreting medical literature, a subject that deserves a post of its own. In fact the piece at issue, "HIV…
Like David Rind over at Evidence in Medicine I'm a consumer of statistics, not a statistician. However as an epidemiologist my viewpoint is sometimes a bit different from a clinician's. As a pragmatic consumer, Rind resists being pegged as a frequentist or a Bayesian or any other dogmatic…
Three statisticians go hunting for rabbit. They see a rabbit. The first statistician fires and misses, her bullet striking the ground below the beast. The second statistician fires and misses, their bullet striking a branch above the lagomorph. The third statistician, a lazy frequentist, says, "…

Lots of specialists claim that personal loans aid a lot of people to live their own way, just because they can feel free to buy needed things. Moreover, different banks give short term loan for different classes of people.

You can't prove any hypothesis, even with a Bayesian approach. You can at best give odds of one hypothesis over a specified set of alternatives, and show those odds are high. This only works within the space of hypotheses you've explicitly chosen, however, and says nothing about the probability of the hypotheses relative to alternate hypotheses you didn't include in your test.

For that matter, I've never understood the point of null hypothesis testing. The null hypothesis is virtually always false, and this can always be proven with enough data: real processes almost never follow any theoretical distribution exactly, and the departure between the two will always show up once your sample size is large enough.

IMHO a Bayesian approach which is better than Bayes factors for hypothesis tests is just to compute the posterior probability that the effect has a given magnitude. Then you can integrate to find the probability that the true effect lies within a "practically small" distance of the null hypothesis â where "practically small" is defined relative to the question being asked, and could be extended to a formal decision-theoretic loss function â rather than pretending that the exact null could ever be true.

By Ambitwistor (not verified) on 30 Jun 2009 #permalink

OMG. I wish people advocating for a Bayesian approach actually knew the first thing about what they are writing about. Gallistel doesn't even know what a likelihood is: footnote 1 on p3 is just clueless.

very cool. Thanks for bringing it to my attention. It'll take me a bit to digest it all, but it looks very promising.

Wow, and I expected any complaints to be about the conditioning data ;)

Bob, can you say whether my summary is an inaccurate representation of the bayesian approach? I would like to appropriately qualify it if I'm summarizing someone who is incorrect.

Ambitwister, I agree with most of what you've written, except i wouldn't underestimate the impact or utility of an approach to test the null which may, however infrequently, actually be correct. It seems important to have a method to detect these (by your account) infrequent and therefore interesting occurences, although your points about the extent to which the "exact null" is true are well taken. In this particular example from conditioning, however, the exact null is theoretically (and apparently empirically) true.

Steps 2 and 3 are really about specifying the prior for the two models, and it would be better to present it that way. It's more consistent with how we do things in reality, and is easier to follow.

Setting a point prior (i.e. assuming that the exact value is known) for the difference is odd, and the method of putting a uniform distribution on the difference seems artificial.

Gallistel recommends calculating the odds on the null as a function of many different plausible effect sizes; in this way, one can incrementally reduce the largest plausible effect size to see if it ever produces a better fit than the null. In the case of Pavlov's dogs, Gallistel shows that "the null is literally unbeatable."

This is where he seriously goes off the rails. As the maximum effect size is reduced, the Bayes Factor must approach 1, because it is 1 in the limit. It has to do this monotonically (as long as there is some support from the likelihood on the parameter values that are being excluded). The shape of the curve is determined by the likelihood function.

The explanations in the article are a bit of a mess: some of the stuff is wrong (like not knowing what a likelihood is), and a lot of the rest is a mess. For the first two examples picking a slightly different prior leads to analytic solutions, so his comments about computation are amusing.

He really does nothing to get round the prior sensitivity of Bayes Factors (not surprising: if there was a simple solution, we would have found it already).

If you want an introduction to Bayesian methods, look at Mick McCarthy's book. He does a good job of presenting the ideas in a simple way.

Thanks Bob. I've updated the post with pointers to your obviously well-informed comment about Gallistel's methods. WRT to your recommendation, I would prefer a book on Bayesian methods in psychology (or some more human-focused science) unless of course you think that book is not overly specific to ecology. Any ideas?

Chris,

Take a look at Gelman et al. _Bayesian Data Analysis_ and its sequel, Gelman and Hill _Data Analysis Using Regression and Multilevel/Hierarchical Models_. (Gelman also has a blog.) These books are classic and oriented toward the social sciences â more sociology and political science than psychology, but the second book does have a dog shock avoidance experiment example among others. They have a pragmatic applied focus. And as you might guess from my previous comment, these books emphasize graphical predictive checks and effect estimates over formal hypothesis testing. I see hypothesis tests as somewhat useful for model/variable selection (pruning), but not so much for drawing scientific conclusions about reality.

By Ambitwistor (not verified) on 30 Jun 2009 #permalink

@Bob: could you give a little bit more constructive criticism instead "bit of a mess...stuff is wrong"?

His explanation of the likelihood function (as a unnormalized function of the parameters) is correct. It is not a probability distribution.

The Bayes factor is by definition sensitive to the prior and (subjective) Bayesians are fine with that because you cannot do inference without making assumptions, which many Bayesians try to make explicit.

"the subjectivist (i.e. Bayesian) states his judgements, whereas the objectivist sweeps them under the carpet by calling assumptions knowledge, and he basks in the glorious objectivity of science'' (I.J. Good)

Gallistel suggests to try with different plausible effect sizes, and not with the whole continuum.

see also similar articles:

Rouder JN, Speckman PL, Sun D, Morey RD, Iverson G. (2009)
Bayesian t tests for accepting and rejecting the null hypothesis.
Psychon Bull Rev. 16(2):225-37.
preprint

Wetzels, R., Raaijmakers, J. G. W., Jakab, E., & Wagenmakers, E.-J. (in press). How to Quantify Support For and Against the Null Hypothesis: A Flexible WinBUGS Implementation of a Default Bayesian t-test. Psychonomic Bulletin & Review preprint

Wagenmakers, E.-J., Lodewyckx, T., Kuriyal, H., and Grasman, R. (2009). Bayesian hypothesis testing for psychologists: A tutorial on the Savage-Dickey procedure. preprint

@Ambitwistor: for imprecise tests see

I. Verdinelli and L. A. Wasserman. (1996). Bayes Factors, Nuisance Parameters and Imprecise Tests. Bayesian Statistics 5, J.M. Bernardo, J.O. Berger, A.P. Dawid, and A.F.M. Smith eds. 765-771. Clarendon Press, Oxford.

I don't know of any books specifically for psychology. Mick's book is only really specific in the examples he uses, so I think is still good for a mew general reader.

If you want to actually use Bayesian methods, Gelman & Hill is a good (if comprehensive) guide.

Incidentally, ignore my comments above about the curve behaving monotonically. That may also be total bollocks. Approaching 1 in the limit is still true, though.

wanders off muttering to himself about normalising constants...

loved the post. Bayesian statistics are of great interest but less understanding to me. I also would like a clear (and simple)exposition of Bayesian methods in psychology. Please comment if you have any ideas.

"Assume..."

"Specify..."

It would be simpler just to assume the answer you want and forget the intervening steps.

By umvue.blue (not verified) on 01 Jul 2009 #permalink

I guess my last comment never showed up. In short, for an applied Bayesian social science text, Gelman et al. _Bayesian Data Analysis_ and its sequel, Gelman and Hill (for multilevel modeling) are classics. More sociology and political science than psychology, though.

By Ambitwistor (not verified) on 02 Jul 2009 #permalink

Clearly I don't fully understand this method, but it seems quite impossible to give a 32:1 likelyhood for the null versus a very small effect size, e.g. 1 vs. 1.01. It seems that the assumption of sublinearity - an effect "somewhere between 1 and 8" - must boil down to a hypothesis of some mid-point of those values.

I think this Bayesian approach is conceptually (though not exactly) equivalent to the equivalence tests described over at lies and stats (thanks Chris for a pointer to Luk Arbuckle's very useful stats in plain English site)
http://liesandstats.wordpress.com/2008/11/07/but-you-can-show-equivalen…

For equivalence tests, you need to specify how equivalent is equivalent; you're not really testing the null of a zero effect, but rather a small range of effect sizes around zero.

As Luk explains, it's probably simplest and best to calculate a 95% confidence interval. If it's small and roughly centers on zero, you can argue with confidence that any effect is too small to be of interest- that the null is approximately true.

Of course, you need a fair amount of power to really get that CI down to a small enough "near zero" size, but the same is true of the standard tests for differences if they're interpreted correctly, as explained in "Why most published research findings are false,"
http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.00201…

another fascinating ref from Chris on his latest post

I've long felt that the "lumpers" need a test to counteract the standard "splitter" statistics- but I've never before realized how simple it was to calculate the proper stats!

Thanks for a great thread, Chris.

By Seth Herd (not verified) on 29 Jul 2009 #permalink