Essential reading: Why prior probability is important in considering the results of clinical trials of so-called "complementary and alternative medicine"

I've become known as an advocate for evidence-based medicine (EBM) in the three years since I started this little bit of ego gratification known as Respectful Insolenceâ¢. One thing this exercise has taught me that I might never have learned before (and that I've only reluctantly begun to accept as true) is one huge problem with EBM. Not surprisingly, it has to do with how EBM is used to evaluate so-called "complementary and alternative medicine" (CAM) therapies, many of which are highly implausible on a scientific basis, to put it exceedingly generously.

Consider homeopathy, which on a strictly physical basis is about as implausible as it gets based on very well-established science. There's almost no doubt on a scientific basis that the concepts behind homeopathy are nothing more than magical thinking writ large and then justified with all sorts of pseudoscientific mumbo jumbo invoking everything from the "memory of water" to ridiculous torturings of quantum mechanics that make me wonder just what these people are smoking. However, it's not hard to find seemingly positive studies suggesting that homeopathy actually does something more than provide a bit of dihydrogen monoxide to the body. True, the better the study, the more likely it is to be negative, with no efficacy shown greater than placebo, but there are some seemingly well-designed studies that purport to show an effect. Granted, the effect is always small and I've yet to see any scientifically convincing reports of homeopathy curing cancer or other non-self-limited diseases, but that doesn't stop the homeopaths.

Or the other purveyors of antiscientific woo, for that matter.

John Ioannidis has, to a great extent, explained why there are so many positive trials of CAM therapies of exceedingly low scientific plausibility. Basically, it has to to with prior probability. Basically, the lower the prior probability of a hypothesis to be tested in a clinical trial to be true, the more likely there is to be false positive trials, far more than the expected 5% false positives that would be expected under ideal conditions using a p-value of 0.05 as the cutoff for statistical significance. Even being aware of this problem, we as advocates of science- and evidence-based medicine have a hard time swatting down individual studies. To the layperson, saying that we must evaluate the totality of the literature is an unsatisfying response to the seemingly positive studies that homeopaths, for example, routinely like to cherry pick.

As I've come to realize, the elephant in the room when it comes to EBM is that it relegates basic science and estimates of prior probability based on that science to one of the lowest forms of evidence, to be totally trumped by clinical evidence. This may be appropriate when the clinical evidence is very compelling and shows a very large effect; in such cases we may legitimately question whether the basic science is wrong. But such is not the case for homeopathy, where the basic science evidence is exceedingly strong against it and the clinical evidence, even from the "positive" studies, generally shows small effects. EBM, however, tells us that this weak clinical evidence must trump the very strong basic science, the problem most likely being that the originators of the EBM movement never saw CAM coming and simply assumed that supporters of EBM wouldn't waste their time investigating therapeutic modalities with an infinitesimally small prior probability of working. But CAM did infiltrate academic medicine, and investigators do investigate such highly unlikely claims. So what to do?

Dr. Kimball Atwood IV explains one proposed solution: the adoption of Bayesian inference for evaluating clinical trial data over the "frequentist" statistical evaluations of clinical trials that have been dominant throughout the careers of every physician now alive. Money quote:

If the prior probability of a hypothesis is small, it will require a large amount of credible, confirming data to convince us to take it seriously. If the prior probability is exceedingly small, it will require a massive influx of confirming data to convince us to take it seriously (yes, extraordinary claims really do require extraordinary evidence).

Which is what I've been saying all along with respect to homeopathy, reiki, distance healing, and many other CAM therapies that, on a physical, scientific basis, are exceedingly unlikely to work.

I encourage skeptics and CAM advocates alike to read Dr. Kimball's post in its entirety and comment.

More like this

You've got your explanation about prior probability the wrong way round. The P-value is by definition the probability of a positive result if the null hypothesis were true. The probability of false positives can't be higher than this.

Rather, the point is that you could achieve a "significant" P-value yet still have the null hypothesis be the best explanation for the observed data, because no other plausible hypothesis explains the data very well either - either because all other hypotheses have low prior probabilities, or because all other hypotheses have low likelihoods.

By Sean Eddy (not verified) on 15 Feb 2008 #permalink

Aaah, the Angels rejoice as another soul sees the Bayesian light.

More seriously, Bayesian methods are now allowable for medical devices, but the NIH hasn't yet seen the light about clinical trials. In fairness, there are problems that need to be sorted out - and I think the NIH are right to be conservative - but there are practical advantages for a Bayesian approach, particularly in Phase I and Phase II trials. Sequential trials are also easier in a Bayesian context, and these can reduce the sample size and time needed.

Bob

Unfortunately, there are a lot of critics of Bayesian inference. While these critics are often wrong, they can sound very, very credible because statistics, especially Bayesian statistics, is (are?) not intuitive. It takes even mathematicians a while to get a handle on Bayesian statistics.

By David C. Brayton (not verified) on 15 Feb 2008 #permalink

At some point empiricism should trump theory (as it should have with hand-washing in the 19th century). But put simply, if there are 100 studies of some alternative therapy, then by chance 5 will seemingly have statistical significance at the 95% level.

It's actually much higher than 5% if prior probability is taken into account. See these three posts:

1. Why Most Published Research Findings are False
2. The cranks pile on John Ioannidis' work on the reliability of science
3. Are Most Medical Studies Wrong?

For studies looking at hypotheses with an estimated low prior probability of being true, noise predominates, and considerably more than 5% will provide "statistically significant" evidence that the hypothesis is true. That's under ideal circumstances. In the messy real world, where poor study design and/or bias can creep in, there are even more false positives.

No, the false positive probability per trial at a P-value of 0.05 is less than or equal to 5%, by definition.

If the null hypothesis is true (with prior probability 1.0), then at a P-value threshold of 0.05, you'll get a positive result 5% of the time. And all of them will be wrong; the false positive probability is therefore 5%.

If the null hypothesis has a prior of 0.5, and if we assume the test hypothesis has 100% sensitivity (no false negatives), then at a P-value of 0.05, we get a positive result 52.5% of the time: 50% true positives, 2.5% false positives, and 48.5% true negatives. As the null model's prior decreases from 1.0, the false positive probability per trial decreases from 0.05.

I think what you're thinking of is the case where we only report "significant" results. Conditional on having obtained a "significant" P-value, in the first case, all of our 5% positives are false, so just among them, we have 100% false positives. Yes, the false positive rate per significant trial can be as high as you want, depending on the priors and the likelihood of the competing hypotheses.

By Sean Eddy (not verified) on 15 Feb 2008 #permalink

It's not a question of Bayesian vs. frequentist statistics. It's a question of cost functions. A statistical procedure is a mapping from a space of possible data to a space of possible decisions/inferences. Our task is to choose a defensibly good procedure. Good in what sense? If our data comes from a probability with some parameters A, then we define a function on A and the decisions. The cost of a procedure at A is the value of this function on A and the decision assigned to it by that procedure. But because function spaces are large, you can't get unique optimality from just a cost function. That's where criteria like maximum likelihood, Bayesian optimality, minimax, etc. come into play. The nicest description of all this is Kiefer's 'Introduction to Statistical Inference' --- a marvelous book except that his notation is Baroque.

The problem with Bayesian statistics is always calculating a prior in a justifiable manner. This is a show stopper in most cases. The classic example is if I am measuring a physical parameter, what prior do I assign to it? Uniform? What if my neighbor measures the logarithm of that parameter instead? If you are measuring where a fairly random spinning wheel stops, you may be able to invoke symmetry to use a uniform distribution for the angle at which it stops instead of the logarithm of that angle, but in general what do you do?

However, say you have a Bayesian prior with two possible outcomes, of probability p and (1-p). If the statistical procedure is monotonically increasing in p, that is, your posterior probability grows as p grows, then you may be able to find a justifiable upper bound on p even if you cannot correctly calculate it.

Bayesian procedures occupy a unique place in statistics which is often overlooked. When trying to select a procedure, we ignore all those which are uniformly worse (according to our cost function) than some other procedure. Those that at least as good as any other distribution for some experimental outcome and underlying distribution are what we want. This class is difficult to characterize, but the class of all Bayesian procedures contains that class, is not much larger, and is much easier to manipulate since it's the same as the function space of priors. So in many cases you can try attempt to construct a Bayesian procedure to satisfy other criteria. Then there is no question of justifying your prior: it's just a mathematical nicety.

To finally get to the point, classical statistical procedures estimating a parameter x from an estimator y on the data result from the cost function (x-y)^2. For the case of clinical trials this says that we want to be able to measure any difference equally well. Instead what we want in cases like this is a function more like (x-y)^2/x^2, which imposes a much larger cost on making mistakes around zero, but lets procedures be sloppier about measuring exactly how good a drug is if it really does make a difference. I don't propose that exact form for real use. Ideally you would use a procedure which is robust to cost functions of roughly that shape.

The 5% chance does not apply to studies really, but to outcome measures. So in a study with 100 outcome measures, you'd expect 5 false positives by mere chance. This was nicely illustrated by an incredibly complex CDC study on thimerosal and autism.

So I wonder not only to what extent unsuccessful research is never published, but also to what extent unexpected outcome measures are never brought forth in a given paper. The scientific literature must obviously have a lot more random noise than one would hope for.

again trying to keep it simple, if 1 in 20 studies of bogus therapies show statisticly significant results at the 95 percent level (by definition). The percentage of published studies with false positive results will be higher due to the selection bias of the publication process. Estimating this is where the Bayes comes into play

again trying to keep it simple, if 1 in 20 studies of bogus therapies show statisticly significant results at the 95 percent level (by definition). The percentage of published studies with false positive results will be higher due to the selection bias of the publication process. Estimating this is where the Bayes comes into play

You're correct but that's not the point I'm trying to make.

When Orac says "far more than the expected 5% false positives that would be expected", he's apparently misunderstanding what a P-value means, and how a prior would be applied. If we're talking about the probability of false positives in studies that have achieved statistical significance at P=0.05, there is no reason to "expect" 5% false positives amongst only those significant studies, where you're a Bayesian or not. The P-value of 0.05 was already applied; it's the probability of false positives over all studies, and it has essentially nothing to do with the probability of false positives amongst the positive studies.

Indeed, a careful frequentist statistician, if asked to state the probability of false positives among the positive studies, would say up to 100% of them (because he's trying to avoid having to state a subjective Bayesian prior!). That's why you want to choose a small P-value in the first place. The P-value is by definition the maximum false positive error you're willing to tolerate. You want it to be so small that you can convincingly exclude the null hypothesis, _regardless_ of what the priors might be.

The problem here isn't a problem with frequentist vs. Bayesian stats; it's a problem with folks not understanding what a "significant" P-value means, especially in a research environment where thousands of studies are being carried out, and only the "significant" ones published. P=0.05 is a really crappy choice for "significance", because 0.05 false positives per study * N studies is a lot of false positives.

By Sean Eddy (not verified) on 16 Feb 2008 #permalink

I would agree. Once one understands that there is a selection bias in the publication of research I do not know what additional value a Bayesian estimate of the liklihood that it is a false positive would add as it relies on a subjective estimate of the selection bias which would only serve to add a veneer of a calculation to what is a subjective judgement. At the end of the day, multiple studies and a good theoretical model is what overcomes this problem, not Bayes theorem

Orac points to

Why most published research findings are false

This in turn points to Why Most Published Research Findings Are False, John P. A. Ioannidis, PLoS Med. 2005 August; 2(8): e124.

To which I say "Bah! Humbug!. The real title of this paper should have been "Why most microarray studies of gene involvement with disease are probably artefact, and why simplistic extrapolations of epidemiological studies are likely to come a cropper". Last time I looked, epidemiology plus microarray studies did not constitute the majority of medical (or biomedical) research. So, to start off with, the title is false.

Even if we relabel the title to cover the fields actually mentioned, are most of them actually false? Let's look at the epidemiology. Ioannidis states

Refutation and controversy is seen across the range of research designs, from clinical trials and traditional epidemiological studies [1-3]

Lets look at one of the papers cited [2, Lawlor DA et al., Those confounded vitamins: What can we learn from the differences between observational versus randomised trial evidence? Lancet. 2004;363:1724-1727]. This paper summarises the fact that there are several epidemiological studies that show an association between heart disease and antioxidant vitamin levels. However, all intervention studies that have tried to reduce heart disease by increasing vitamin levels have had no effect (or even produced adverse outcomes; a similar effect has been seen with cancer prevention studies). However, is the issue here one of hypothesis testing, and the clinical study false because it failed to take into account priors? Or is the mismatch telling us something valuable?

Consider the case of folate and cancer. Epidemiology showed that people with leukaemia had low folate levels. Folate was known to be important for health, so it was hypothesised that giving people with cancer folate supplements would increase their survival. In fact it made the patients die faster, as the folate was being used by the tumour cells. A whole class of anti-cancer therapies is based on blocking folate utilitization, based on these findings. The problem was not that one of the observations was "wrong", and wrong Bayesian priors being used. Both studies were correct, but that the hypothesis linking low folate levels and cancer survival was wrong. This was something that would not be addressed by Bayesian priors at all.

Let's return to the antioxidants, the epidemiology shows that high levels of antioxidants are correlated with lower levels of cardiovascular disease. A whole host of basic science supports this (cell culture, tissues and many whole animal studies). Using Bayesian logic, as advocated by Ioannidis, we should adjust the probability of the "failing" clinical studies, some of which would become positive as a result (note this is exactly the situation stated in Oracs article about CAM, we modify the prior probability of the clinical study in the light of basic science). However, the failing studies are almost certainly telling us something important about the complex role of antioxidant status and cardiac (or any other disease). The association between antioxidant levels and cardiac disease may be an indicator of some other causative process that will not be affected by the disease; there may be a narrow "window" when intervention with antioxidants is possible, and so on.

Again, it is important to point out that using Baysean priors to "correct" the probabilities of the antioxidant trials in the light of epidemiological and basic science evidence (and the basic science data is very compelling) would at best have no effect, and controversy still remain, or at worst cover up an important data mismatch that tells us something of physiological importance (compare with the folate and cancer mismatch, which lead to a new treatment modality). There is a basic confusion in Ioannidis's logic here. A mismatch between a randomised clinical trial and epidemiological and basic science does not imply that one of the other is false, both may be correct, and uncover new, unsuspected, biology.

Another example is the red wine antioxidant resveratrol. As an antioxidant, it should have no effect on things like cardiovascular disease, as it's in vivo concentrations are too low with respect to it's observed epidemiological effects. Using Bayesian priors in early research, on the same basis that it has been proposed to exclude CAM effects, would have lead us to exclude resveratrol. Yet resevratrol appears to have real effects, mediated by mechanisms that were not previously clear (steroid receptors, Sirtuins etc.).

I am not for a moment claiming CAM has any evidential basis, but I am pointing out that Bayesian analysis is not the panacea it has been claimed, and the claims for Bayesian analysis are somewhat exaggerated. I'd like to see someone do a real Baysean analysis of the trial data for antioxidants, in the light of the epidemiological and basic science evidence, not the toy analyses that are done to promote the use of Baysean statistics, and explain how a Baysean analysis helps (and how you objectively convert the in vivo and in vitro animal and tissue data into a sensible prior).

As for the

.., false findings may be the majority or even the vast majority of published research claims [6-8]

. I am quite open to the idea that genetic associations, be they generated by microarrays or otherwise, are largely wrong. However, they are most emphatically not "most modern research". Again, I suggest that here too the issues lie with mundane things like the reliability of the arrays, population sampling and other methodological issues that have no bearing on the Bayesian vs Frequentist debate.

Baysean analysis may, or may not, help us understand the occasion "positive" result in testing CAM modalities, but given the hype, and the frankly incorrect claims that some Baysian promoters use, it will be hard to tell until someone come up with a robust methodology applicable to real world examples. Patently ridiculous claims that "most research findings are false", and confused thinking about what it means for a study to be "false", cannot help this debate.