Marilyn Mann pointed me to an interesting post by David Rind over at Evidence in Medicine (thanks!). It’s a follow-on to an earlier post of his about the importance of plausibility in interpreting medical literature, a subject that deserves a post of its own. In fact the piece at issue, “HIV Vaccines, p values, and Proof,” raises lots of interesting questions about different ways to view what we mean by probability (the nub of the frequentist/Bayesian debate), the difference between proof and evidence, the thorny and mental cramp produced by the question of multiple comparisons, and finally this observation:

I’ve been struck since the beginning of my medical training that biostatistics is an area of mathematics where people can have remarkably different opinions about what is or is not not resonable. I don’t usually think of “opinion” and “mathematics” falling in the same sentence, but perhaps that just shows my lack of expertise in theoretical math. (David Rind, Evidence in Medicine)

Rind gives an essentially correct version of what a p-value is, at least from the frequentist perspective. Well it’s close enough for practical statistics, anyway. Here’s what he says about a difference between HIV infections in a randomized, placebo-controlled trial of a vaccine that showed a difference in the two groups that had a p-value of .04 (4%):

The analysis that showed the statistically significant benefit (p = 0.04), was a reasonable one. There were 56 people who developed HIV in the vaccine arm, and 76 in the placebo arm, but seven patients were found to be infected with HIV at the time the vaccine was administered, and this analysis excluded those patients. Before we come back to that decision, we should keep in mind what that p value means — what it is that has a 4% chance of having happened?

Although this causes untold confusion for physicians at all levels of training, that p value says that (excluding any problems with the design or performance of the trial), if vaccine were no better than placebo we would expect to see a difference as large or larger than the one seen in this trial only 4 in 100 times. This is distinctly different from saying that there is a 96% chance that this result is correct, which is how many people wrongly interpret such a p value.

[NB: a reader says these numbers are for intention to treat and give a different p-value, but the point was about p-values, not the vaccine paper, so just assume the .04 number as given in Rind's post is correct or change the p-value to .08; what the p-value means is the same].

It’s a little hard to say how Rind thinks of probability, since this is a distinctly frequentist formulation although at other points he talks like a Bayesian. You don’t have to follow that difference at this juncture, but I’d like to unpack his version a bit more so you can see some of the machinery under it.

p-values are usually associated with the frequentist perspective. They are the product of some statistical test (I’m not sure which one was used in the paper, but it doesn’t matter for purposes of this discussion). Statistical tests these days can be very sophisticated mathematically but most have the same kind of underlying idea. For a frequentist, probability is synonymous with the idea of the long run frequency of an event, hence the term “frequentist.” The classic example is flipping a coin and counting the frequency (number of times) it comes up heads or tails. If you flip a fair coin 10 times you might get 6 heads and 4 tails by chance. The frequency of heads is thus 60%, but we wouldn’t say from so little data this proves the coin was unfair, with a 60% probability of coming up heads. If we wanted a better idea of the coin’s “true” probability of coming up heads we’d flip it many more times, say 1000. Or 10,000. The more times we flip it the better fix we get on the true probability of it coming up heads (which for a coin that was truly “fair” would be 50%). If we imagine flipping a coin over and over and measuring the frequency of heads as we go along, it should approach the true probability as the number of flips gets very, very large. If we perform the mathematical magic of increasing the number of flips without limit and look to see what relative frequency is approached closer and closer, that would be the *definition* of true probability for a frequentist (Bayesians have a completely different notion of probability based on degree of belief, confidence or, if you want to spell it out, how much they are willing to bet on one outcome or the other as a measure of degree of belief).

It’s a little hard to see what the frequentist perspective has to do with the vaccine trial at first. After all, we aren’t doing the trial over and over again. We are only doing it once. We have to imagine something like this. We have a giant (actually infinite) population of people at risk of contracting HIV. Let’s pull out 16,000 people (the number is from the vaccine trial) and give half the vaccine and leave half without vaccine. We assume that each of these people has a certain probability of contracting HIV in 5 years. So in 5 years we count up the number of HIV infected in each group (which in the case of the vaccine paper was 56 for the vaccinated group and 76 for the unvaccinated group). Imagine each infection as a biased coin flip with the chance of coming up heads equal to the chance of being infected. If the chance of getting HIV were exactly the same in both groups, you might still see a different number of infections by chance, just as you might see different number of heads and tails by chance in two sets of 10 coin flips. The more people (the more times you flip the coin), the better the true probability of infection will be given in each group, which is why small clinical trials are less informative than large ones.

Before getting to the p-value, let me emphasize something Rind mentions parenthetically but is crucially important: we must exclude “any problems with the design or performance of the trial.” The statistical testing assumes an idealized situation, in particular one where there are no errors in how things were set up or the experiment performed (crudely put, you can’t use the equivalent of broken meters and expect sophisticated statistical testing to give you the right answer). The set-up needs to be as much like an idealized coin tossing-like situation as possible, and departures from it will give incorrect answers. Of course the imagined idealization doesn’t have to be like coin tossing. It could be like picking different colored marbles from multiple urns or tossing a tie with 6 sides (or 12 sides or 13 sides) or more elaborately, things that are based on underlying physical principles that tell you what kind of probabilities to expect. The idealized set-up is referred to as the underlying probability model.

With the help of an underlying probability model we can use mathematics to calculate how frequently we would get 56 infected in one group and 76 in the other if the vaccine really did nothing. We express that frequency as a p-value. As Rind correctly puts it, a p value of .04 means that if we did this over and over again and the vaccine really did nothing we could still expect a difference as big or bigger than 56 vs 76 to occur less than 4% of the time. Since it is a frequentist probability, it requires imagining performing the trial over and over again. We don’t need to do this in reality if we have an underlying probability model because we can calculate what would happen if we *did* do it in reality. For example, if we assumed a coin was fair or say biased to come up heads 60% of the time or whatever we might choose, we could use mathematics to calculate how likely 6 heads out of 10 would be without actually doing the coin flipping. That’s what a statistical test is doing in the vaccine trial but it requires probability model assumptions so we don’t have to do the trial over and over again in the real world to get the probabilities in the two groups (remember what probability means to a frequentist).

Notice how many things are at play, here. There is the correct identification of a population all of whom are at risk of developing HIV and that risk is essentially the same (there are sophisticated ways to adjust for differences in people but let’s keep it simple here; while adjustments can be made, they require additional assumptions so you haven’t eliminated them, just used different ones). Said another way, you correctly identified the at risk population and the underlying probability model that lets you make the test calculation. In addition, you didn’t make any systematic mistakes (something that affected one group more than the other) in whom you chose to study, which ones to get the vaccine, how you gave them the vaccine, how you diagnosed HIV, etc., etc. In other words, performance of the study had no systematic errors (if the errors were random you would just need more people for an equally valid study). Finally, the (frequentist) notion of probability is correct. There are some real intellectual battles here, not the least of which involves the idea of that infinite population you had to imagine. I’ll come back to these battles shortly.

The interpretation in this case is that the difference that was seen would be pretty uncommon if the probability of HIV infection were the same in both groups, where uncommon means only 4% of the time. It could happen but isn’t likely (a Bayesian would say he/she wouldn’t bet on it without some pretty good odds). But as Rind points out there are other complications here. There were those excluded people. This gives rise to all sorts of arguments over whether they should have been included in the analysis (called an “intention to treat” analysis because the intention was to vaccinate and count them until it was discovered they already were HIV positive) and whether doing several analyses where they were excluded or included means that you have to alter the calculation of the p-value because you are now making multiple comparisons and that changes things. For some of these problems there are no clear cut answers which is another reason among many why just because something has been subjected to a randomized clinical trial (RCT) it is not the answer to everything. RCTs are difficult to do and full of pitfalls in execution, analysis and interpretation.

Still, the logic of these studies is fairly simple: either a particular hypothesis is false (e.g., that there is a difference between vaccinated and unvaccinated groups); OR, something unusual has occurred. p-values are used to standardize studies so we can compare different reports. p-values that are small signify unusual events, where the definition of “unusual” is left to convention. In medical studies 5% is the most common definition, although in physics 10% is also frequent. If the outcome of the experiment is considered unusual in this sense, we say it is “statistially significant.”

Unfortunately the word “significant” in “staistically significant” is frequently misunderstood in the more colloquial sense of “important.” It doesn’t meant that. It just means that something unusual occurred relative to what you thought would occur, given a particular hypothesis (e.g., that the vaccine doesn’t make a difference). It’s just the beginning, not the end of the process of putting the evidence from the study into some kind of context. That’s where much of Rind’s two posts concentrate their attention, and there is much to say about that, too, but this post is already too long. Still, I can’t refrain from making one more comment prompted by his excellent piece.

The fundamental difficulties created by things like the meaning of probability or what randomization does in an RCT (we didn’t talk about that and it is surprisingly controversial), creates all sorts of confusions, some not even recognized by investigators adhering to different views or using statisticians who do. This comes as a surprise to most scientists. If you have studied statistics in text books you’d never know it, which is why Rind, too, was surprised:

“I don’t usually think of “opinion” and “mathematics” falling in the same sentence, but perhaps that just shows my lack of expertise in theoretical math.”

It’s not a matter of expertise as much as the fact that statistics is a discipline riven by faction is a well kept secret. This isn’t just about frequentists and Bayesians, but Fisherians and Neyman-Pearsonians, logical probabilists, likelihood advocates and more. In his book, *Statistical Inference*, Michael Oakes says this (h/t D.O.):

It is a common complaint of the scientist that his subject is in a state of crisis, but it is comparatively rare to find an appreciation of the fact that the discipline of statistics is similarly strife-torn. The typical reader of statistics textbooks could be forgiven for thinking that the logic and role of statistical inference are unproblematic and that the acquisition of suitable significance-testing recipes is all that is required of him. (Oakes, Statistical Inference, Epidemiological Resources, Inc., Chestnut Hill 1990)

Oakes goes on to quote a book review of a statistics text in a technical journal (the reviewer is Dusoir) :

“A more fundamental criticism is that the book, as almost all other elementary statistics texts, presents statistics as if it were a body of coherent technical knowledge, like the principles of oscilloscope operation. In fact statistics is a collection of warring factions, with deep disagreements over fundamentals, and it seems dishonest not to point this out.”

Both of Rind’s posts bring up a large number of interesting issues like this. Read them here and here (links repeated for your convenience). At some point I hope I’ll have time to take a further look at them.