Statistical Significance Is an Arbitrary Convention

By drorzel on September 12, 2011.

In typical fashion, no sooner do I declare a quasi-hiatus than somebody writes an article that I want to say something about. For weeks, coming up with blog posts was like pulling teeth, but now I'm not trying to do it, it's easy...

anyway, that's why there's the "quasi-" in "quasi-hiatus," and having been reasonably productive in the early bit of the weekend, I have a few moments to comment on this column by Ben Goldacre about bad statistics in neuroscience. It seems lots of researchers are not properly assessing the significance of their results when reporting differences between measured quantities:

How often? Nieuwenhuis looked at 513 papers published in five prestigious neuroscience journals over two years. In half the 157 studies where this error could have been made, it was. They broadened their search to 120 cellular and molecular articles in Nature Neuroscience, during 2009 and 2010: they found 25 studies committing this fallacy, and not one single paper analysed differences in effect sizes correctly.

These errors are appearing throughout the most prestigious journals for the field of neuroscience. How can we explain that? Analysing data correctly, to identify a "difference in differences", is a little tricksy, so thinking generously, we might suggest that researchers worry it's too longwinded for a paper, or too difficult for readers. Alternatively, less generously, we might decide it's too tricky for the researchers themselves.

But the darkest thought of all is this: analysing a "difference in differences" properly is much less likely to give you a statistically significant result, and so it's much less likely to produce the kind of positive finding you need to look good on your CV, get claps at conferences, and feel good in your belly. Seriously: I hope this is all just incompetence.

That's fairly bad, but the details of this aren't what I want to talk about. What struck me about this was that it's really an example of a larger problem, namely the fetishization of "statistical significance." Which is a bad thing, because in the end, statistical significance is a wholly arbitrary matter of convention.

Goldacre's column, and the study he's writing about, take issue with researchers claiming findings are "statistically significant" when they haven't done the proper tests to show that. For those not in the business, "statistically significant" is a term of art meaning that the measured effect meets a particular mathematical threshold. It varies a little among fields, but in the life sciences, this is generally defined as a p-value less than 0.05, meaning a less than 5% probability that the result would've occurred by chance.

As you can guess from the clause "it varies a little among fields," though, this threshold is entirely a matter of convention. The p-value is a continuous variable, and can take on absolutely any value between 0 and 1. The value of 0.05 is just a convenient choice of a dividing line between "significant" and "not significant" results.

And that's really my gripe with this whole business. Because far too many people writing about this sort of stuff talk about statistical significance as if there were a qualitative difference between results that meet the standard and results that don't. Which is just nonsense-- a result with a p-value of 0.04 is not dramatically better than one with a p-value of 0.06, but one of those is conventionally deemed significant and the other is not.

Significance is a continuum, and whatever threshold you apply to it is going to be arbitrary. And using an arbitrary threshold as your sole criterion of worth to the point where people are, consciously or subconsciously, tempted to fudge their statistics to meet it is kind of crazy.

What makes the whole thing especially absurd, though, is that I strongly suspect all of these values are almost completely worthless. I say this because of particle physics, where the conventional threshold for claiming a significant result is much higher-- the usual language for talking about this is in terms of the number of standard deviations a result is away from the expected value. In those terms, the 5% threshold used in the life sciences is about 2 standard deviations. The standard for claiming detection of a new particle is 5 standard deviations, or a probability of around 0.0000006 that the result occurred by chance. Anybody trying to make a claim based on a 2-standard-deviation measurement would be laughed out of the room. (Yet more evidence of the arbitrariness of the whole business-- when the standard for significance varies by five orders of magnitude between fields, something's fishy.)

The particle physics standard is where it is for good reason, because as we've seen several times over the last couple of years, results at the 3-standard-deviation level go away remarkably frequently. In principle, that level of statistical certainty means that only 0.27% of such results should turn out to be chance fluctuations, but in practice, it seems to be more like 60% of them that are bogus.

This happens because the effects being studied are very subtle, and analyzing the data requires making some models and assumptions about the operation of the frighteningly complex detectors required in the business. Those models and assumptions are just shaky enough that the level of certainty you would naively expect based on statistical measures doesn't really work out. Nobody's being dishonest-- the uncertainties they assign are the best they can manage-- but at some level they're dealing with Rumsfeldian "unknown unknowns." If they had a better idea of what was wrong, they would be able to account for it. As it is, though, some things are beyond their control, and that leads to three-sigma results going *poof* with some regularity.

But as fearsomely complicated as particle physics experiments are, I don't for a second believe that they're less well controlled than life-science experiments. And, indeed, you see regular bouts of hand-wringing about how often significant findings go *poof*-- see, for example, Derek Lowe on irreproducible drug discovery candidates, or even Jonah Lehrer's much-derided "decline effect" piece from last year. The simplest explanation for this is probably that, just as in particle physics, the numerical tests of "significance" aren't really all that meaningful. A lot of results that appear significant in a single experiment turn out to be chance, despite the fact that the uncertainty analysis for that experiment suggested it was a solid result.

As a result, arguments about whether a given result is just above or just below an arbitrary and conventional threshold seems foolish. Doing the calculations wrong is still a major mistake, but whether they're done correctly or not, we should stop pretending that "statistically significant" is some kind of magic guarantee of quality.

More like this

Do you have a methodological alternative to suggest?

A methodological alternative? No. This is purely a sociological sort of complaint-- that collectively, we're putting a little too much weight on what are, ultimately, arbitrary category distinctions. A measurement with a p-value of 0.06 is not worthless compared to one with a p-value of 0.04, but we too often act as if it is.

A nice writeup, and I very much like calling out the arbitrariness of "significance", but I take a little bit of issue with the claim that:

> The particle physics standard is where it is for good reason, because as we've seen several times over the last couple of years, results at the 3-standard-deviation level go away remarkably frequently. In principle, that level of statistical certainty means that only 0.27% of such results should turn out to be chance fluctuations, but in practice, it seems to be more like 60% of them that are bogus.

To get the 0.27% number, one has to assume some sort of knowledge about things that we don't have any knowledge. Ferinstance, if we're talking about a search for new particles, and there are no new particles, then 100% of the results will be false positives from chance fluctuations. If the statistics are being done correctly, that'll be one false positive for every 300-or-so experiments (the 0.27% number), but the fact that 100% are bogus won't be an indication that anything is amiss.

Or are you saying that 60% of high-energy physics experiments are producing false positives at the 3-sigma level? I'm not in a position to dispute that number, but offhand that strikes me as implausibly high. There are a lot of high-energy experiments going on, and not that many new particle claims.

In ecology and population genetics, the situation is even worse. Everybody uses statistical significance, and the actual magnitudes of the measures are often secondary and are often not interpreted. This causes biologists to cheapen their critieria about what is an adequate measure of the effect they study. Any measure will do for these people, as long as it shows statistically significant results. As examples, I could mention Gst and Fst in population genetics, and many measures of diversity and differentiation in ecology. Many of these measures have deep mathematical flaws, but who cares, we can get p-values out of them!! I can give some particularly grim examples of this if anyone is interested.

In nature, it is never interesting to know only that there is a statistically significant difference between groups. Groups always differ in any measure worth studying, if only in the tenth decimal point. That difference, however small, can be detected and reported at whatever level of statistical significance you want, if sample size is large enough. This kind of science is a meaningless game that can always be won if you have the resources to take a large enough sample. It demeans science.

The correct approach is almost always to use a measure whose magnitude is interpretable, and determine confidence intervals around the measure as the expression of statistical uncertainty.

The only times that p-values are really a useful final answer are cases in which the mere existence of an effect is surprising. For example, if someone were to measure the speed of light in a vacuum using photometers moving at different velocities, it would be interesting and newsworthy if someone rejected the null hypothesis that there was no difference between treatments at p<0.001. Such situations are rare in the life sciences.

There is a great blog post by mathematician/biologist Tom Leinster about this, with many references. The title used the same word as you used about this problem: "Fetishizing p-values".

http://golem.ph.utexas.edu/category/2010/09/fetishizing_pvalues.html

See my website or Google for my papers on how bad the standard measures of population genetics and ecology can be; the explanation is partly this cheapening of meaning by enshrining the p-values and ignoring the magnitudes of the measures.

Evidence for Higgs bosons is another case like the speed of light.

But Chad, this is much more than a sociological problem. This is a deep problem at the root of much bad science, at least in biology and many other fields.

While I agree with your overall point (0.5 is an arbitrary cutoff and 0.6 and 0.4 really aren't much different).

If you have a p-value of 0.003 and run an experiment 1000 times, you expect to find about 3 false-positives. If there's no real effect, those will be the only positives you get, and 100% of them will be wrong. Since there's no effect, moving out to 5-sigma doesn't change the percentage of positive results that are wrong, it's still 100%. All it means is you have fewer wrong results to sort through.

To calculate the chances of a measured positive being wrong, you need to know the chances of getting a false-negative. Let's say your false-negative rate is 0.1% and you have a p-value of 0.003 (i.e., a false-positive rate of 0.3%). You would expect, out of 1000 measurements, to get about 3 false positives and about 1 true result. In other words, the chance of your result being wrong is not 0.3%, it is 75%.

The difference in cut-offs for p-values is probably because particle physics involves higher false-negative rates than many biology experiments.

It is instructive to imagine what a science would look like if all its ideas were rubbish, but it was very careful about statistical significance.

Given that modern AI is heavily influenced by probabilistic methods, it is not surprising that AI research has its own variant of these.

Human nature being what it is, it is also not surprising that we have our fetishists, and our share of people who have a hard time grasping that it's either an artificial threshold, or a threshold finely tuned through experiment and experience... or both.

Whoops - my example should have a false-negative rate of 99.9%. to get the 1 true positive in 1000 measurements

The reason for the low significance threshold in particle physics is that you are looking for effects with a tiny probability. This means you need an extremely long run of data and that the probability of obtaining a result by chance is a lot higher. In contrast, an effect that small in say medicine would be so insignificant that it would not be worth bothering with, e.g. if 1% of people vs. 1.00000001% of people recover from an illness. Since the life sciences are concerned with larger effects and they typically have much smaller data sets, it is not unreasonable for them to use a higher significance threshold.

However, your main point that significance thresholds are always arbitrary is still valid. The solution to this is to use Bayesian statistics, which still contains some arbitrariness, but at least that arbitrariness is made explicit. Personally, I would prefer an estimate of the probability that a hypothesis is true to a fallaciously clear cut significant vs. non significant claim. Sadly, it will be a long time before we see this, as the Fisher significance methodology is set in stone by regulatory bodies like the FDA.

In principle, that [3σ] level of statistical certainty means that only 0.27% of such results should turn out to be chance fluctuations, but in practice, it seems to be more like 60% of them that are bogus.

As others have pointed out, you are mistaking the chance of getting a false positive in an arbitrary experiment for the chance that, given you have a positive result, it is a false positive. Generally, when you run 1000 experiments and three of them appear to show something interesting, you are not going to follow up the other 997 to see whether they should have shown something interesting but didn't. That's especially true in particle physics, where the need to limit data collection rates leads to triaging out events that are deemed "uninteresting" by the people who designed the data collection software. So you look at the three "interesting" experiments, and you find that in two of them the result does not hold up (there is some instrumental or physical effect that you did not consider, or perhaps it really was a statistical fluke).

As for why particle physics uses a 5σ threshold for significance, part of the answer is, "Because they can." A particle physics experiment is designed to repeatedly collide a particle of known energy into a target, and they can accumulate the millions of shots needed to confirm a putative phenomenon to 5σ of significance. Most fields, even within physics, don't have that kind of luxury. The experiment that confirmed the GZK cutoff, to take one example, had detectors spread out over a large area of the Argentine pampas because they were looking for particles with fluxes of the order of 1 per km^2 per year, or less. You will never get a 5σ result with a detector that is that big because it needs to be.

The 60% figure was an excessively deadpan joke. I probably should've said "60% of 3-sigma results reported by experiments whose funding is up for renewal turn out to be bogus."

I think the question of false positives is a little murky for particle physics experiments, in that the figures reported are the number of detections above the expected background. In that sense, a "known" false positive rate is already built into the important measurements. This is the usual source of the problems, though, as determining that background appears to be something of a Black Art. Since they're looking for really unlikely events in the first place, small errors in determining that background can wipe out otherwise solid results.

Actually, may I point out that you do not expect there to be 0.27% false positive from a 3 sigma result?

3 sigma means that 0.27% of the noise is at or above that strength, not that 99.73% of the events at that significance or above are not noise.

Thus, to properly evaluate the rate of probability of a result being a false positive, one would have to make some kind of integration of the total signal at the desired significance and above as well as an integration of the noise at that signal and above. The probability of a false positive is then p=1-(S-N)/S where S is the signal integral and N is the noise model integral. Note that (S-N)/N is the true positive probability.

This is not the same thing as the error function integral that gives the classic results for n-sigma significance. Just having an event at n-sigma means nothing at all if the signal distribution is consistent with pure noise.

Just a friendly reality-check from your friendly neighborhood astronomer.

I remember one of my thesis advisors insisting that results that were not significant at the 5% level could not be discussed, while anything that was significant had to be explained. He really did not get the basic concept of significance, yet other faculty deferred to him for his statistical knowledge.

Part of the reason for the varying standards used in different fields is the consequences of saying there is no difference when in fact there is. In medicine, it is generally more important to avoid missing a genuinely better treatment than to wrongly think a treatment is better when in fact it is not. Therefore, P = 0.10 is often used as the cut-off value. Expectations are also important. You would not use P = 0.10 for investigations into the paranormal.

An alternative procedure to just using the probability of falsely obtaining a difference is to use the Type I / Type II error ratio as the criterion, but it is harder to use and harder for most people to visualise and does not seem to have caught on.

The obligatory XKCD link.

One of the fundamental concepts I attempt to drill into my student's heads!

I generally lead them by the hand starting with trees and conditional probability, then introduce some (semi) real-world examples to motivate the correct choice. Being in the undergrad sequence, usually this is a pregnancy test with some contrived numbers so that I can then get them to appreciate being wrong one way is in some sense worse than being wrong another way, no matter what the probabilities (or the costs or the actual incidence) might suggest. I find that constantly referring back to these simple models as we go through hypothesis testing, p-values,paired samples, independent samples, factorial design, etc. really grounds them when it comes to designing a procedure that's more prone to type I or type II errors as the case may be. It's also instructive as a class exercise to compare the different Î²'s different students with to assign to the pregnancy test :-)

Counterintuitively, this seems to be something that the business types seem to get better than the science types, possibly because probability (and statistics) are treated more formally in the math-heavy track.

I would agree with the poster above who noted that statistical ignorance seems to be part and parcel of all the sciences, physicists as much as psychologists, btw. It's just that physicists happens to have the luxury of working in a field where the collection of large data sets tends to mitigate that ignorance. Hopefully in this will be one of the big game-changers of the 21st century, when everyone in every discipline will have access to large, cheap data-sets courtesy of our modern consumer habits.

Part of the reason for the varying standards used in different fields is the consequences of saying there is no difference when in fact there is. In medicine, it is generally more important to avoid missing a genuinely better treatment than to wrongly think a treatment is better when in fact it is not. Therefore, P = 0.10 is often used as the cut-off value. Expectations are also important. You would not use P = 0.10 for investigations into the paranormal.

Got it in one, and as I just wrote, something I try to make really clear to my students. I think what Chad's criticism comes down to is not that these p-values are arbitrary (and Bayesian self-promotion to the contrary, they don't deal with it any better either; having that critical number come from some formula doesn't make it any less arbitrary), but that there isn't enough discussion in the papers as to why a particular value was chosen - or whether or not it was chosen beforehand ;-) There's a rather significant corpus of work on the origins of the canonical "p=0.05", btw. Here's one such meditation:

There are many theories and stories to account for the use of P=0.05 to denote statistical significance. All of them trace the practice back to the influence of R.A. Fisher. In 1914, Karl Pearson published his Tables for Statisticians & Biometricians. For each distribution, Pearson gave the value of P for a series of values of the random variable. When Fisher published Statistical Methods for Research Workers (SMRW) in 1925, he included tables that gave the value of the random variable for specially selected values of P. SMRW was a major influence through the 1950s. The same approach was taken for Fisher's Statistical Tables for Biological, Agricultural, and Medical Research, published in 1938 with Frank Yates. Even today, Fisher's tables are widely reproduced in standard statistical texts.

Fisher's tables were compact. Where Pearson described a distribution in detail, Fisher summarized it in a single line in one of his tables making them more suitable for inclusion in standard reference works*. However, Fisher's tables would change the way the information could be used. While Pearson's tables provide probabilities for a wide range of values of a statistic, Fisher's tables only bracket the probabilities between coarse bounds.

The impact of Fisher's tables was profound. Through the 1960s, it was standard practice in many fields to report summaries with one star attached to indicate P 0.05 and two stars to indicate P 0.01, Occasionally, three starts were used to indicate P 0.001.

Still, why should the value 0.05 be adopted as the universally accepted value for statistical significance? Why has this approach to hypothesis testing not been supplanted in the intervening three-quarters of a century?

It was Fisher who suggested giving 0.05 its special status. Page 44 of the 13th edition of SMRW, describing the standard normal distribution, states

The value for which P=0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation ought to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant. Using this criterion we should be led to follow up a false indication only once in 22 trials, even if the statistics were the only guide available. Small effects will still escape notice if the data are insufficiently numerous to bring them out, but no lowering of the standard of significance would meet this difficulty.

Which is pretty much in accord with the apocrypha I'm familiar with. That is, the current obsession with the 0.05 number is a historical artifact traced back to a very few influential sources that the current crop of senior researchers were exposed to en masse back in the 60's and was in any case an explicitly acknowledged rule of thumb from the beginning.[1] Perhaps as the older generations die off younger researchers will be more statistically savvy . . . in part because modern computing makes such savviness (at least in the mechanics) so easy.

[1]There was a certain yellow hardcover dating back to at least the 30's, I believe, and which seemed to be the first canonical text after differential equations for budding physics majors back in the 70's and 80's. Does anybody perchance recall what it was? Very plain illustrations, as I recall, with a course yellow binding. But in any event, it seemed to be a book that everyone of a certain age who was a physics major back then seemed to be familiar with.

One nit pick:

The standard for particle physics is due to experiences going back a half century. I believe the first editorial about "bump hunting" standards (and hence what papers would be rejected without review) goes back to when "Letters" were short columns in the Physical Review or the first year or two of Physical Review Letters.

The problem, by the way, is mostly with background events and how you distinguish background fluctuations from a real signal. The best trick was smoothing the data until fluctuations turned into bumps.

"In principle, that level of statistical certainty means that only 0.27% of such results should turn out to be chance fluctuations, but in practice, it seems to be more like 60% of them that are bogus."

No, it doesn't mean that at all. Learn the difference between p-values and posterior probabilities please.

@17:

I'm guessing you mean the first edition of the data analysis book by Bevington.

"meaning a less than 5% probability that the result would've occurred by chance."

Ah not really. And I agree with #17. A regular 0.05 p value means that if you take a specific model with a given distribution and take samples from that particular model 5% of the time you will get something that matched the sample you have in some properties.

If you notice there is no mention of reality in that description at all. The hidden assumption that reality is modeled accurately by normal or other distributions means P values are useful heuristics sometimes, but should never be used to give odds or probabilities. Or compared ... or ...

One thing I'm surprised no one has mentioned (well, not surprised exactly - you'll never lack for people correcting your statistics in blog comments!) is the word "significance", which after all has a standard english language meaning as well as a statistical one. You are obviously correct about the arbitrary nature of the P-value, and anyone who has worked with an even slightly complex system understands the allure of fudging the stats (throw out the high and low outliers, or use a different test until you get the P-value you need). A big part of science is resisting that temptation - it helps to think about the poor grad student or undergrad who tries to use your methods for their own work! But the english meaning of "significant" is a big deal here - how would you like me to say your work is insignificant. I certainly would not like that. But not every P less than 0.05 effect in every system is "significant" in the colloquial sense. So we get in trouble when we slide down the slippery slope from finding out if an effect exists to finding out if it is "significant". Especially if the grant renewal depends on it...

Many years ago I had the privelage of taking classes with the statistician Jonckeehere. His revelations on the abuse of statistics made my jaw drop. The worst of course being that nearly all data sets are simply assumed to have all the properties that are needed to make a particular test of significance valid in the first place. He had us in stitches showing how simple slieghts of hand could pretty much magic what you wanted - hiding the outlier with one hand while moving the wand with the other.

As to p 0.05 his question was always 'would you get on a plane if those were the odds?'

In the end I felt really that research and statistics should be done by different people, and that being able to be quick on the data pad with you stats package was far to often assumed to mean you knew what you are doing.

Found a citation for the "bump hunting" editorial I remember seeing on someone's wall, which was from 1970:

http://prl.aps.org/abstract/PRL/v25/i12/p783_1

However, my vague memory was that there were earlier editorial policies about bumps that might be cited on those yellowing pages.

Something else that stunned me when I came across it first is the business of 'postselection' that is frequently played in the life sciences. Basically, after you've collected the data you look through subsamples with as many categories as you like till you've found some statistical deviation. Needless to say, you should be adjusting your statistical significance to your random sampling of subgroups. (See above xkcd link.) But what's more interesting, in physics collaborations go to lengths to make sure they've decided on a procedure to evaluate data a priori rather than a posteriori to avoid exactly this pitfall. (Of course once the data is public, somebody else will go and find a 'signal' in noise.) One wonders, how come that sound procedures don't spread?

I remember one of my thesis advisors insisting that results that were not significant at the 5% level could not be discussed, while anything that was significant had to be explained.

This is the sort of thing that first got me thinking about this business. Several years ago now, I heard a student talk on something econ-ish (might have been psychology) where the student introduced a model with about nine different elements that they hypothesized would change in different ways under their experimental conditions, but only explained the results for three of them for each of two subgroups (not the same three, either-- he talked about the changes in three parameters for one sub-group, then the changes in an entirely different set of three parameters for the other sub-group). I asked about the other six in the Q&A period, and was told "Oh, we measured those, too, and they changed in the direction we expected, but the differences weren't statistically significant." He had clearly been told that anything that wasn't significant didn't count, despite the fact that even non-significant shifts in the right direction on six more parameters should count as further evidence supporting their model.

Again, in biology, psychology, economics, and many other fields, the real problem is not the mere misjudging of statistical significance. There is a much deeper problem, which nobody here is touching. The whole enterprise of testing a null hypothesis of "no effect" is usually misguided. Most of these problems are problems of parameter estimation (for which confidence intervals are the proper expression of statistical uncertainty), not null hypothesis testing with its accompanying p-values.

This is the point of McCloskey and Ziliak's famous book, The Cult of Statistical Significance (an odd book but one which makes an important point).

Indeed. Not only is statistical significance arbitrary (a matter of degree, no "real" boundary where e.g. 0.95 significance is different in kind), but statistical claims are not strictly "falsifiable" in Popperian terms. Think: No particular experiment can prove or falsify a statistical claim such as "the chance of X is 0.7" (or even a claim it's within a range.) You do the run and get 0.702, but the "real" chance could be 0.6 and this was just the run you got. If you do get 0.57, it could just be a chance of getting that from "real P" of 0.7 at work, and so on. (And remember, a number of trials is "a" experiment because your knocking off throwing dice etc. for awhile and doing more later is not a real boundary either.) The small chance of getting an off result is not the point, that's just more example of the same "demarcation" problem.

This also brings up issues, if there are infinite other worlds etc. how should they *define* "laws" if all possible statistical results happen somewhere? If somewhere the statistics of decay for Co-60 imply a "real" half-life of 3 years instead of 5, is nuclear physics "really different" there or should physicists in that world comprised of unlikely runs use their theoretical acumen to decide what "should" have happened and decide they are indeed in a world that is a rare statistical anomaly?

This also has some implications for glib notions of MWI: if all the outcomes occur, then the "normalcy" of the results actually illustrating e.g. the Born Rule are in question: Sure, there are more such worlds by combinatorics but even then there are branching statistics and not the actual BR distribution. (The flaws of BR in MWI were harshly exposed in the big discussion earlier about the problem of the BR in MWI) the statistics apparently should be based on "how many" branches, how else to get frequentist statistics? BTW I do not think apologists got far with defending MWI in that thread or in my blog post about it either. Really, claiming that with an infinite number ... but in trials there are specific outcomes and also infinite sets are incommensurable anyway in that proportional way.

Bravo! âStatistical significanceâ is not only arbitrary, but its pervasive use in the interpretation of results is hindering innovative thinking and detrimental to the development of science. The cutpoint of 0.05, subjectively introduced by Neyman and Pearson as an acceptable probability for a type 1 error (though in some instances, a type 2 error may be more severe), eliminates the need for researchers to think critically about their results. It provides an âobjectiveâ âyesâ or ânoâ response when few questions in science are so simple or parsimonious. Furthermore, the answer to why we emphasize random error rather than systematic error and biases in science remains elusive to me. Shouldnât we be more concerned about human error? In epidemiology we tend towards the use of confidence intervals; useful indicators of effect estimate precision. To no avail, however, because these mathematical tools are now used to conduct tests for statistical significance at that oh-so-arbitrary level of 5%. I commend those who are looking beyond the p-value and recommend that we bring intelligent thought back into the interpretation of our results.

ScentOfViolets,

"Bayesian self-promotion to the contrary, they don't deal with it any better either; having that critical number come from some formula doesn't make it any less arbitrary"

A Bayesian typically wouldn't choose a critical number at all (although there is such a thing as a Bayesian p-value). Rather, they'd report the posterior distribution for the parameters or predictions of interest, or a Bayes factor or odds ratio. If they did choose a threshold, they'd probably tie it to some loss function through decision theory.

But then, frequentists don't have to use arbitrary "significance" thresholds either. They can report confidence intervals or effect sizes instead of p-values.

First, using a Pvalue cutoff of 0.05 is not completely arbitrary. Rather it represent a reasonable compromise between type I and type II errors.

Having said that, the difference of 0.049 and 0.051 is of course small and should not warrant opposite conclusions.

The use of a hard cutoff comes from the Popperian philosophy of either rejecting or not rejecting a certain hypothesis (aka model). Often this binomial approach is unreasonable in biology.

More and more life scientist do realize that effect sizes are of more interest than pvalues alone. Especially in non-experimental biology. What use is it to show that the Japanese tsunami had an "significant" effect on local wildlife, without specifying with how much it was altered?

The last paragraph of Andreas's comment (#31) hits on the real problem with p-values. P-values, even if interpreted correctly, don't tell us what we want to know. Biologically useful inferences demand a different statistical model. Not null-hypothesis testing and p-values but parameter estimation and confidence intervals.

All conventions are, by definition, arbitrary.

Statistical significance as a methodological convention is as arbitrary as both real-world requirements and human intuition. It is a tool to detect truth according to a large set of assumptions, not an arbiter of the holistic truth. [Ad infinitum about how statistical signficance is about much, much more than the p-value.]

Two and three standard deviations are the convention in many sciences because they lie at easy-to-calculate and easy-to-understand intersections in the formulas for our tiny brains, something very useful prior to the advent of cheap computers. They are also at easy-to-translate ratios for explanatory purposes, 1-in-20 and 1-in-100, where many of the other numbers generally line up well for most (but not all) purposes (and which still should be looked at to detect anomalies).

Of course it's arbitrary, but it's a very useful arbitrary.

"a p-value less than 0.05, meaning a less than 5% probability that the result would've occurred by chance."
Actually, a P value of less than 0.05 means that differences in outcome that large will arise by chance alone less than 5 percent of the time. It says nothing at all about reality, about the probability of the result having occurred by chance.

I'm a statistician. As to why there are so many bogus three-sigma results, bad experiments is the most likely answer. The other possibility that given enough experiments there will eventually be a false positive.

As to different branches of science using different levels of significance, well, it is arbitrary then why is this a problem? It isn't.

>> This also brings up issues, if there are infinite other worlds etc. how should they *define* "laws" if all possible statistical results happen somewhere? If somewhere the statistics of decay for Co-60 imply a "real" half-life of 3 years instead of 5, is nuclear physics "really different" there or should physicists in that world comprised of unlikely runs use their theoretical acumen to decide what "should" have happened and decide they are indeed in a world that is a rare statistical anomaly?

I guess that would be the world where coin flips always come up heads. So no one would believe in probability, and rightly so. Anyone who said he had seen a coin flip come up tails would be considered a crackpot.

More seriously, it is almost certain that in our world some bogus results have been accepted out of pure luck. They aren't significant results. If they were, someone would try to use that result, and would get something that didn't work. If the statistical anomaly went on forever, well, that's less likely than anything you can imagine that is at all possible. Much less likely than, say, a W Bush fifth term.

Patrick, you made a good point - and this means indeed that if we accept a 95% confidence rate, then 5% of supposed "true" results are indeed bogus! Maybe this accounts for some of the odd reversals and confusion we hear about. Maybe people should concentrate more on just reporting what the statistical correlation was, and let people draw their own conclusions, instead of worshiping at the 0.95 altar.

As for the other points: remember, if all the worlds exist, then there *are* ones where weird stuff happens, so talking about "the chance" is almost pointless (and again, what happens with infinite sets?) If determinism is true after all, it isn't the case "there is still a 50/50 chance after a big run of heads that you will get tails next time." Uh, no: there is a preordained outcome and it is whatever is the configuration of the world you happen to be in. I suggest studying my points above some more, check out my blog etc.

Uh, sorry, I can't imagine I made the mistake of putting the replied-to name in the Name field but I did, sorry.

"check out my blog"

OK, I did, and replied at length in the "Philosopher" thread. Enjoy.

The corollary I hate may not have gotten enough attention: the authors get p=.064 or such (perhaps after a long struggle of dumbing down the data or test), and write in abstract and discussion that there is no association. Sometimes their result is "no association" and they provide no statistics whatsoever - we must guess they mean p was bigger than .05, but we have no idea by how much.
And they may actually think they have done something like prove there is no association. But it's so simple: just collect small and crappy data. Dichotomizing the continuous variables is also useful. Oh, and don't take logs.
(My experience is with biomedical research, where the training of many biology PhD's in the design and analysis of experiments is abysmal. In 2 factor crossed experiments, they sometimes don't get when it's the interaction that should be tested - even after you explain it three different ways.)

I did realize sometime later that my last sentence in #39 is exactly the "difference of difference" thing the original articles were talking about.
For me it can go control and real knockdown shRNAs, with and without drug. I see some people first show gene A goes up with the drug in the knockdown cells, but feeling that this is somehow not enough ("why did I run those control cells?") then show that the knockdown with drug has gene A at greater expression than control shRNA with drug. I tell them they aren't testing the right hypothesis. They respond "but we got a p-value, and it was small". I say "actually, did you notice that you did two tests rather than one", and try to draw pictures of two lines that are or are not parallel.
There are biologists with beautiful minds who apprehend these things with astonishing speed, and never forget them, but then there are the others, whose minds close the minute I write the model down - gotta remember to not do that so much.

5 out of 4 people don't understand statistics.

I can't help but reminded of a conversation that sounds the same on what might seem to be a different subject, but one which concerns me greatly as a pedagogue.

I speak, of course, of the arcane and dreaded procedure of assigning grades at the end of the semester :-)

Now, if I had my druthers, I wouldn't assign letter grades at all. I'm not sure what I'd replace them with, but something like a form with the level of proficiency in various categories with radio buttons like "has attained mastery" or "understands the basic concepts" ticked off wouldn't be any worse than what I'm doing now. Of, course, in the end, I would have to give a firm yes or no as to whether a student has passed the class or needs to retake it. And the administration would expect me to be able to justify my decision by referring to an impartial process that gives an unambiguous answer.

Sound familiar :-)

Replace the pass/fail cut-off with a p-value and pretend that what are really subjective (though accurate) judgement calls in either context were arrived at by turning a reliable crank, and I think that we're really talking in the end about the same thing.

Sure, I'd love to give a more detailed evaluation of my students progress rather than a single-valued parameter. But at the end of the day, no one is really interested in a set of sliding scales or reporting a range; they want just one number. The same, alas, seems to be true when reporting experimental results as well.

One irony is that statistical significance turns falsifiability on its head. Instead of attempting to falsify one's proposed hypothesis, one looks for evidence that a different hypothesis (the Null Hypothesis) is false. Having rejected the Null Hypothesis at some level of statistical significance, one then claims that that that rejection supports one's proposed hypothesis. Indeed it does, but it supports any hypothesis except the Null! With regard to one's proposed hypothesis, statistical significance is actually confirmatory, and, as we know, confirmation is weak.

Min, it is even worse than that. As I explained above, the null hypothesis in sciences like biology and psychology is usually KNOWN TO BE FALSE IN ADVANCE. Rejecting such a null hypothesis provides no confirmation of other hypotheses at all, since we didn't even need to do the experiment to reject that kind of null hypothesis. Yet the literature is full of this nonsense, and many ecologists can't see the problem.

As to p 0.05 his question was always 'would you get on a plane if those were the odds?' Which is clearly purely rhetorical..

A better question would be: would you fly airline x or airline y, crashes for x > crashes for y and p<0.05..
And I guess your choice would also depend on the absolute effect size of crashfrequencies.

Hmmm ... I don't think things are as bad as that, because one can presumably rationally assume that if a controlled factor "X" is present compared to when it isn't, disconfirmation of the null hypothesis "X doesn't have an effect" would be supportive that"X" is the reason for correlation, not other things, no? Also I suggest at least perusing my comments about physical law in general, QM and MWI if you have any interest in larger implications. BTW I previously referred to 0.95 confidence which is simply the other half of the 0.05 for disconfirmation.

Neil, you said "if a controlled factor "X" is present compared to when it isn't, disconfirmation of the null hypothesis "X doesn't have an effect" would be supportive that"X" is the reason for correlation, not other things, no?"

No, I don't think so. Disconfirmation of an always-false null hypothesis provides no information about the world. One starts to get real info when there are estimates of the magnitude of the deviation from the null hypothesis. This is not provided by the p-value (which is confounded with sample size).

Neil @46: Your post assumes that you are in a lab setting where you can control factor X. Many of us work in fields (such as astrophysics, geophysics, or environmental science) where you have to take whatever data you can get, and your formulation leads to the correlation/causation issue.

Generally speaking, Lou is right that what you are really interested in is the magnitude of the effect. Typically you are working with some kind of model which already tells you what correlations to expect. In this case p > 0.05 where you were expecting an effect is evidence that your model is nonsense, but p <= 0.05 doesn't tell you anything you didn't already know. The one exception would be if you really don't have a model and are taking shots in the dark (this scenario doesn't arise often in physical sciences, but there may be situations in the humanities or some branches of medicine where it does). In this case p-values might suggest what kind of model you need to construct. But it's still up to you to construct that model.

Well, Lou's critique is structurally a vague-context thing: sure, *if* you have the control, that is almost by definition that a correlation is with a specific factor. But yes I know, what if e.g. smokers also don't get enough of some nutrient which is really why they get cancer, so if they got more of that the smoking wouldn't kill them etc. So yes it's hard to prove a connection. So now what, how much of science can we trust?

Neil, you are missing my point. I am not saying we should throw up our hands, I am pointing out that there is a correct way to do these analyses, and p-values are not involved.

Take the hypothesis that an oil spill reduced the biodiversity of a beach. The null hypothesis is that there is no difference in diversity between the oiled beach and a neighboring clean beach. So we measure the diversities of the beaches and do a statistical test on the samples to see whether they differ in diversity. Say the p-value is highly significant, meaning that the data are extremely unlikely if the null hypothesis of "no differences" were true. We would reject the null hypothesis. But what can we conclude from that? Nothing, because no two natural beaches are ever identical, so the null hypothesis ("no differences") will always be rejected for any two beaches, if sample sizes are large enough. Furthermore, if the effect of the oil is to reduce diversity by a tiny fraction, this too will produce a p-value as significant as we would like, if sample size is large enough.

The question we really should be asking is not "Is there a difference in biodiversity between these two beaches?" but rather "How big is the difference?" We answer that question by making sure that we use a diversity measure whose magnitude is easily interpretable, and using confidence intervals to express the uncertainty in our measurements. Then we can compare the magnitude of the difference in diversity between the two beaches, and see if it is biologically important.

Null-hypothesis-testing asks the wrong question, and p-values should not be used in these kinds of studies. As I said earlier, null-hypothesis-testing and p-values ARE appropriate when physicists analyze the Higgs data, because there the mere existence of an effect is the interesting question.

I should add that in any real study of oil pollution, we would have to use multiple beaches as controls, not just one.

So, how was the 5SD physics standard established?

Statistical Significance Is an Arbitrary Convention

More like this

Go On Till You Come to the End; Then Stop

Meet Charlie

Physics Blogging Round-Up: August

The Age Math Game

Kid Art Update

Explaining the Higgs: on TV last night!

Gekkota part II: loud voices, hard eggshells and giant calcium-filled neck pouches

How to rot down dead bodies: the Tet Zoo body farm