Applied Statistics

Jeremy Miles pointed me to this article by Leonhard Held with what might seem like an appealing brew of classical, Bayesian, and graphical statistics:

P values are the most commonly used tool to measure evidence against a hypothesis. Several attempts have been made to transform P values to minimum Bayes factors and minimum posterior probabilities of the hypothesis under consideration. . . . I [Held] propose a graphical approach which easily translates any prior probability and P value to minimum posterior probabilities. The approach allows to visually inspect the dependence of the minimum posterior probability on the prior probability of the null hypothesis. . . . propose a graphical approach which easily translates any prior probability and P value to minimum posterior probabilities. The approach allows to visually inspect the dependence of the minimum posterior probability on the prior probability of the null hypothesis.

I think the author means well, and I believe that this tool might well be useful in his statistical practice (following the doctrine that it’s just about always a good idea to formalize what you’re already doing).

That said, I really don’t like this sort of thing. My problem with this approach, as indicated by my title above, is that it’s trying to make p-values do something they’re not good at. What a p-value is good at is summarizing the evidence regarding a particular misfit of model do data.

Rather than go on and on about the general point, I’ll focus on the example (which starts on page 6 of the paper). Here’s the punchline:

At the end of the trial a clinically important and statistically significant difference in
survival was found (9% improvement in 2 year survival, 95% CI: 3-15%.

Game, set, and match. If you want, feel free to combine this with prior information and get a posterior distribution. But please, please, parameterize this in terms of the treatment effect: put a prior on it, do what you want. Adding prior information can change your confidence interval, possibly shrink it toward zero–that’s fine. And if you want to do a decision analysis, you’ll want to summarize your inference not merely by an interval estimate but by a full probability distribution–that’s cool too. You might even be able to use hierarchical Bayes methods to embed this study into a larger analysis including other experimental data. Go for it.

But to summarize the current experiment, I’d say the classical confidence interval (or its Bayesian equivalent, the posterior interval based on a weakly informative prior) wins hands down. And, yes, the classical p-value is fine too. It is what it is, and its low value correctly conveys that a difference as large as observed in the data is highly unlikely to have occurred by chance.