A couple of weeks ago, Jonah Lehrer wrote about the Decline Effect, where the support for a scientific claim often tends to decrease or even disappear over time (ZOMG! TEH SCIENTISMZ R FALSE!). There’s been a lot of discussion explaining why we see this effect and how TEH SCIENTISMZ are doing ok: PZ Myers gives a good overview, and David Gorski and Steven Novella also address this topic. Monday, a long-time reader asked me about the article, and I raised another issue that hadn’t–or so I thought–been broached:
There’s also a sample size issue: if the effect is should be weak and sample sizes aren’t that large, any statistically significant result will likely be false and not real (i.e., the result you see is probably a statistical fluke and not the biology, since the biology would result in a much weaker difference that statistically you can’t detect). The way around this is either retrospective power tests (i.e., trying to figure out what you could actually detect and determining if your result makes any sense biologically) or a Bayesian approach (i.e., given a prediction of a weak effect, I would expect to see outcome X; the only problem here is that your prior needs to be appropriate).
Fortunately, by way of a follow-up piece by Lehrer, I stumbled across a similar explanation by Andrew Gelman. Lehrer, unfortunately, doesn’t really describe what Gelman is getting at, however.
Gelman (and he has some good slides over at his post) is claiming, correctly, that if the effect is weak and you don’t have enough samples (e.g., subjects enrolled in the study), any statistically significant result will be so much greater than what the biology would provide that it’s probably spurious. You might get lucky and have a spurious result that points in the same direction as the real phenomenon, but that’s just luck.
Let me give an example. If I flip an evenly-weighted coin 100 times, 95% of the time, I will see 43-57 heads. So if I flip the coin 100 times, and get 55 heads and 45 tails, I can’t conclude that my coin is biased. Now suppose I nick the coin to give it a slight bias: 51% heads, 49% tails. Let’s say this time I also wind up with 55 heads and 45 tails. I still can’t conclude that the coin is biased*. If I flipped the coin eleventy gajillion times (i.e., lots), and wound up with a ratio of 55 heads:45 tails, I could conclude that the coin is biased. But in the first case of 100 flips of a slightly biased coin, there is a ~6.6% chance that I could wind up with a result of 58 heads or more**, a result that also happens to be statistically significant. But that’s a fluke: most of the time, the coin will show a weak bias, about which I will be unable to conclude anything (other than that my thumb, after flipping a coin a bunch of times, gets very tired).
In the coin-flipping example above, I know a priori it’s a weak effect. But the problem with real-life experimentation is that we often don’t have any idea what the outcome should look like. Do I have the Bestest Cancer Drug EVAH!, or simply one that has a small, but beneficial effect? If you throw in a desire, not always careerist or greedy (cancer does suck), to want or overestimate a potential large effect, the healthy habit of skepticism sometimes is observed in the breach. Worse, if you’re ‘data-mining’, you often have no a priori assumptions at all!
Note that this is not a multiple corrections or ‘p-value’ issue–the point isn’t that sometimes you’ll get a significant result by chance. The problem has to do with detection: with inadequate study sizes plus weak effects, anything you detect is spurious, albeit sometimes fortuitous.
So what someone will do is report the statistically significant result (since we tend to not report the insignificant ones). But further experiments, which often aren’t well designed either, fail to pick up an effect. The ones that are well designed and have a large sample size will either identify a very weak real effect, leading to a consensus in the field of “Meh”, or correctly fail to find a non-existent effect.
Sounds like the Decline Effect to me.
Do your power calculations, kids….
*Likelihood and Bayesian methods will also fail in this particular example.
**There is also a 4.4% chance that I will observe 42 or fewer heads. This means that 11% of the time, I will conclude that the coin is biased, but, in those eleven percent of cases, forty percent of the time, I will think the coin is biased in the wrong direction.