In typical fashion, no sooner do I declare a quasi-hiatus than somebody writes an article that I want to say something about. For weeks, coming up with blog posts was like pulling teeth, but now I’m not trying to do it, it’s easy…
anyway, that’s why there’s the “quasi-” in “quasi-hiatus,” and having been reasonably productive in the early bit of the weekend, I have a few moments to comment on this column by Ben Goldacre about bad statistics in neuroscience. It seems lots of researchers are not properly assessing the significance of their results when reporting differences between measured quantities:
How often? Nieuwenhuis looked at 513 papers published in five prestigious neuroscience journals over two years. In half the 157 studies where this error could have been made, it was. They broadened their search to 120 cellular and molecular articles in Nature Neuroscience, during 2009 and 2010: they found 25 studies committing this fallacy, and not one single paper analysed differences in effect sizes correctly.
These errors are appearing throughout the most prestigious journals for the field of neuroscience. How can we explain that? Analysing data correctly, to identify a “difference in differences”, is a little tricksy, so thinking generously, we might suggest that researchers worry it’s too longwinded for a paper, or too difficult for readers. Alternatively, less generously, we might decide it’s too tricky for the researchers themselves.
But the darkest thought of all is this: analysing a “difference in differences” properly is much less likely to give you a statistically significant result, and so it’s much less likely to produce the kind of positive finding you need to look good on your CV, get claps at conferences, and feel good in your belly. Seriously: I hope this is all just incompetence.
That’s fairly bad, but the details of this aren’t what I want to talk about. What struck me about this was that it’s really an example of a larger problem, namely the fetishization of “statistical significance.” Which is a bad thing, because in the end, statistical significance is a wholly arbitrary matter of convention.
Goldacre’s column, and the study he’s writing about, take issue with researchers claiming findings are “statistically significant” when they haven’t done the proper tests to show that. For those not in the business, “statistically significant” is a term of art meaning that the measured effect meets a particular mathematical threshold. It varies a little among fields, but in the life sciences, this is generally defined as a p-value less than 0.05, meaning a less than 5% probability that the result would’ve occurred by chance.
As you can guess from the clause “it varies a little among fields,” though, this threshold is entirely a matter of convention. The p-value is a continuous variable, and can take on absolutely any value between 0 and 1. The value of 0.05 is just a convenient choice of a dividing line between “significant” and “not significant” results.
And that’s really my gripe with this whole business. Because far too many people writing about this sort of stuff talk about statistical significance as if there were a qualitative difference between results that meet the standard and results that don’t. Which is just nonsense– a result with a p-value of 0.04 is not dramatically better than one with a p-value of 0.06, but one of those is conventionally deemed significant and the other is not.
Significance is a continuum, and whatever threshold you apply to it is going to be arbitrary. And using an arbitrary threshold as your sole criterion of worth to the point where people are, consciously or subconsciously, tempted to fudge their statistics to meet it is kind of crazy.
What makes the whole thing especially absurd, though, is that I strongly suspect all of these values are almost completely worthless. I say this because of particle physics, where the conventional threshold for claiming a significant result is much higher– the usual language for talking about this is in terms of the number of standard deviations a result is away from the expected value. In those terms, the 5% threshold used in the life sciences is about 2 standard deviations. The standard for claiming detection of a new particle is 5 standard deviations, or a probability of around 0.0000006 that the result occurred by chance. Anybody trying to make a claim based on a 2-standard-deviation measurement would be laughed out of the room. (Yet more evidence of the arbitrariness of the whole business– when the standard for significance varies by five orders of magnitude between fields, something’s fishy.)
The particle physics standard is where it is for good reason, because as we’ve seen several times over the last couple of years, results at the 3-standard-deviation level go away remarkably frequently. In principle, that level of statistical certainty means that only 0.27% of such results should turn out to be chance fluctuations, but in practice, it seems to be more like 60% of them that are bogus.
This happens because the effects being studied are very subtle, and analyzing the data requires making some models and assumptions about the operation of the frighteningly complex detectors required in the business. Those models and assumptions are just shaky enough that the level of certainty you would naively expect based on statistical measures doesn’t really work out. Nobody’s being dishonest– the uncertainties they assign are the best they can manage– but at some level they’re dealing with Rumsfeldian “unknown unknowns.” If they had a better idea of what was wrong, they would be able to account for it. As it is, though, some things are beyond their control, and that leads to three-sigma results going *poof* with some regularity.
But as fearsomely complicated as particle physics experiments are, I don’t for a second believe that they’re less well controlled than life-science experiments. And, indeed, you see regular bouts of hand-wringing about how often significant findings go *poof*– see, for example, Derek Lowe on irreproducible drug discovery candidates, or even Jonah Lehrer’s much-derided “decline effect” piece from last year. The simplest explanation for this is probably that, just as in particle physics, the numerical tests of “significance” aren’t really all that meaningful. A lot of results that appear significant in a single experiment turn out to be chance, despite the fact that the uncertainty analysis for that experiment suggested it was a solid result.
As a result, arguments about whether a given result is just above or just below an arbitrary and conventional threshold seems foolish. Doing the calculations wrong is still a major mistake, but whether they’re done correctly or not, we should stop pretending that “statistically significant” is some kind of magic guarantee of quality.