# Live by statistics, die by statistics

There is a magic and arbitrary line in ordinary statistical testing: the p level of 0.05. What that basically means is that if the p level of a comparison between two distributions is less than 0.05, there is a less than 5% chance that your results can be accounted for by accident. We'll often say that having p<0.05 means your result is statistically significant. Note that there's nothing really special about 0.05; it's just a commonly chosen dividing line.

Now a paper has come out that ought to make some psychologists, who use that p value criterion a lot in their work, feel a little concerned. The researchers analyzed the distribution of reported p values in 3 well-regarded journals in experimental psychology, and described the pattern.

Here's one figure from the paper.

The solid line represents the expected distribution of p values. This was calculated from some theoretical statistical work.

…some theoretical papers offer insight into a likely distribution. Sellke, Bayarri, and Berger (2001) simulated p value distributions for various hypothetical effects and found that smaller p values were more likely than larger ones. Cumming (2008) likewise simulated large numbers of experiments so as to observe the various expected distributions of p.

The circles represent the actual distribution of p values in the published papers. Remember, 0.05 is the arbitrarily determined standard for significance; you don't get accepted for publication if your observations don't rise to that level.

Notice that unusual and gigantic hump in the distribution just below 0.05? Uh-oh.

I repeat, uh-oh. That looks like about half the papers that report p values just under 0.05 may have benefited from a little 'adjustment'.

What that implies is that investigators whose work reaches only marginal statistical significance are scrambling to nudge their numbers below the 0.05 level. It's not necessarily likely that they're actually making up data, but there could be a sneakier bias: oh, we almost meet the criterion, let's add a few more subjects and see if we can get it there. Oh, those data points are weird outliers, let's throw them out. Oh, our initial parameter of interest didn't meet the criterion, but this other incidental observation did, so let's report one and not bother with the other.

But what it really means is that you should not trust published studies that only have marginal statistical significance. They may have been tweaked just a little bit to make them publishable. And that means that publication standards may be biasing the data.

Masicampo EJ, and Lalande DR (2012). A peculiar prevalence of p values just below .05. Quarterly journal of experimental psychology PMID: 22853650

Tags

### More like this

##### Correlations between gun ownership, suicide and homicide
hambidge writes: There is no real correlation with total homicide. Why do you say 14 countries? Didn't they leave out N. Ireland, and cook the numbers for Switzerland? Since much disagreement surrounds the use of those two countries, do the analysis again with the remaining 12. One gets a…
##### A new statistic begins to appear in journals: What the heck is a p-rep?
What is "significant" research? In most psychology journals, "significant" results are those measuring up to a difficult-to-understand statistical standard called a null-hypothesis significance test. This test, which seems embedded and timeless, actually has its origins in theoretical arguments…
##### Statistical Significance Is an Arbitrary Convention
In typical fashion, no sooner do I declare a quasi-hiatus than somebody writes an article that I want to say something about. For weeks, coming up with blog posts was like pulling teeth, but now I'm not trying to do it, it's easy... anyway, that's why there's the "quasi-" in "quasi-hiatus," and…
##### Correlation between gun ownership and homicide rates
Henry E. Schaffer said: In articles various people say things like: By the way, values of 0.48 and 0.45 are REALLY BAD. and then argue over whether these are or should be publishable, etc. In summary --- AARRGH! A correlation, in itself, is neither good/bad nor publishable/unpublishable. One…

" Note that there’s nothing really special about 0.05; it’s just a commonly chosen dividing line" ....

"But what it really means is that you should not trust published studies that only have marginal statistical significance."

You may want to reconsider the use of arbitrary in the first sentence if you're going to follow it up with such pedantry. This is a problem of the human brain preferentially processing black and white decisions. Being slightly on one side or the other of the 0.05 cutoff should not determine the value or trustworthiness of the study.

@Luke: I think you've missed the point.

The use of the specific value of P<0.05 to define "statistically significant" *IS* arbitrary. It is a nice round number, and just as arbitrary as the "5-sigma" (equivalent to P<0.0000005733) threshold used in my field of particle physics.

If the literature as a whole contained a distribution of reported P values which was consistent with what you would expect from true randomness (i.e., our measurements are always honest estimators of some underlying true value, and the P values reported reflect those estimators), then your own last statement would be true. Any arbitrary cut would be equivalent to any other arbitrary cut. That theoretical distribution is what is shown by the solid curve in the figure.

But, PZ Meyer's point is that we do *NOT* see a distribution consistent with honest estimators. Rather, what we see is a very narrow, and not statisticially consistent, pileup of P value just at (in fact, just below, so on the "good side") of the _arbitrary_ cut-off used by the scientfic community.

Since, as you agree, the P<0.05 value of cutoff to define "significant" is arbitrary, that non-statistical pileup of reported P-values cannot be a reflection of the actual behaviour of measurements in reality. Instead, it is an indication of bias. I mean, as PZ Meyers means, "bias" in the proper statistical sense, that the reported measurement may not be an "honest estimator" of an underlying true value.

There is, obviously, no way to know in any particular case, whether a given report with "P<0.049" or "P<0.043" is biased or not. But the pileup of values just on the edge of what is considered in the community as "interesting enough to publish" should lead the interested reader to study the paper with a bit more scrutiny.

I would point out, as a particle physicist, that it would not surprise me in the least to find a similar pileup of papers reporting "5.1" or "5.4" sigma results, given the similarly arbitrary, and similarly important standard of "5-sigma" as defining a "discovery" (as opposed to just "evidence for"). I don't know whether such a meta-analysis has been done.

By Michael Kelsey (not verified) on 13 Aug 2012 #permalink

Michael @2052: I would say, based on that chart, that a reported p = 0.043 is probably honest (in the sense that if it is in error, it is an honest error and not a reflection of experimenter bias). The circles in the chart are showing p values in bins of width 0.005, and it is only the 0.045 < p < 0.05 bin that shows a statistically significant excess (the other points look like reasonable scatter about the expected curve). But some of the results that fall into that bin are indeed suspect. I agree that checking for an excess of particle physics results with significance just above 5σ would be interesting. Particle physics uses the more rigorous threshold in part because they can (millions of events to work with), while psychology experiments that satisfy more rigorous tests than p < 0.05 are hard to design and execute (sample size being one limiting factor).

Two other remarks:
(1) From the lack of excess points just above p = 0.05, I see no evidence that people are trying to "disprove" hypotheses by shading p-values upward. You can get a publication from disproving an established hypothesis as well as by establishing a hypothesis.
(2) Remember that even if the p value is honestly obtained, it may still be a false positive. When there should be no correlation, you have a 5% chance of getting p < 0.05.

By Eric Lund (not verified) on 13 Aug 2012 #permalink

#1: " Being slightly on one side or the other of the 0.05 cutoff should not determine the value or trustworthiness of the study."

It doesn't completely determine it but it certainly is evidence. Honest, sound research should deliver roughly the same number of results at p=0.049 as at 0.051. If we find that 'just under' values are much more common than 'just over', then it suggests the existence of a mechanism that generates spurious 'just under' results; applying Bayes' Theorem then tells us that a 'just under' result is quite likely to be dodgy.

#3: Yes, in theory you can get a publication by disproving an established hypothesis, but in general there's a well-known bias towards publication of positive results. Finishing a paper is a lot of work, and researchers and publishers are only human; it's hard to maintain enthusiasm for writing up a negative finding when you could be out exploring things that might give a positive.

By Geoffrey Brent (not verified) on 13 Aug 2012 #permalink

Oh come on now, people? Isn't this obvious? You are far more likely to write a paper as soon as you devise an experimental set-up capable of producing p<0.05 results. If the number is above 0.05, you are unlikely to publish. Later research might refine your experimental technique to bring p lower.

The bias is not in the experiments, but in the biased selection criteria used in this study for selection of results, namely, that those results have been published.

I conclude that my hypothesis is true with a probability of 95%. :)

By Paul Holt (not verified) on 13 Aug 2012 #permalink

What this observation doesn't account for is that widely reported statistical feature, the "trend towards statistical significance", which is conference-speak for "not statistically significant"...

I very much doubt the problem is confined to the fluffy world of psychology :-)

I agree with Holt, many studies with results over 0.05 will simply not get published (not worth the trouble), so it is to be expected to have more studies under 0.05 or just under 0.05 than in the rest of the curve.

Once that factor is accounted for, only then we can evaluate if we have more 0.049 results than we ought to have.

I'm tempted to quote "the closer you get to humans, the worse the science gets", but I second the call for a metanalysis of results in particle physics and the like.

Meyer's

Myers'.

When there should be no correlation, you have a 5% chance of getting p < 0.05.

In other words, if you throw 100 tests at such a situation, on average 5 of them will find p < 0.05. Corrections need to be done for multiple statistical tests of the same hypothesis; such corrections exist, but some authors and some reviewers may not know that.

By David Marjanović (not verified) on 14 Aug 2012 #permalink

It is commonplace, in quack research, to test a dozen variables during the same experiment, and if ANY of the variables reaches the magik p<0.5, then you claim that your nostrum has an effect.
For example, you were testing a Royal Jelly and eucalyptus preparation for efficacy against asthma, and you measure it improved toenails growth with p<0.05. Success ! Your asthma medication WORKS!

@Michael_Kelsey I get the point, I was simply cautioning PZ not to throw out the baby with bathwater as he is sometimes wont to do.

We should question every result, but unless you're willing to assume that marginal significance equates with fraud and therefore completely undermines the integrity of the data, I just don't see how this study in question suddenly means that "you should not trust published studies that only have marginal statistical significance".

If someone "massages" their dataset to move a marginally significant result to one side or the other, it is still roughly the same result. Why? Because the p-value cutoff is arbitrary!

This is a serious systematic problem, but the only real solution I see is to NOT reject studies for publication simply because of marginal significance. Why not? Because the p-value cutoff is arbitrary!

It's an arbitrary distinction that has resulted in a non-arbitrary publication standard -- and as we see here also results in publication bias or at worst fraud.

For some of you the solution seems to be to just push the arbitrary line even further by discrediting studies with marginal significance!

It's no wonder that we only have two political parties -- anything with more than two choices is too complex and time-consuming to deal with so we find a way to reduce it to binary choice problem.

And no, I am not some disgruntled academic with marginally significant, unpublished results. Seriously. Thanks for asking though.

We should question every result, but unless you’re willing to assume that marginal significance equates with fraud and therefore completely undermines the integrity of the data, I just don’t see how this study in question suddenly means that “you should not trust published studies that only have marginal statistical significance”.

You don't need to assume it's fraud; you only need to assume it's bias.

By David Marjanović (not verified) on 15 Aug 2012 #permalink

@David You're still missing my point. Marginally significant results are not worthless. The line is arbitrary, being on one side of it or the other does not tell you the practical significance of the results. Therefore, it comes down to the degree of bias you want to assume for these "biased" results -- which is basically unknowable. If you think it's a slight bias then I don't think it changes much -- it's an arbitrary line anyways. If you think that this suggests greater fraud, then as always with fraud we're basically screwed.

_Arthur...
Doing multiple comparisons/experiments and reporting only the significant ones is a big no-no. I'm not saying it doesn't happen, but we do have well established methods which adjust downward what p-value is required to be "significant" when doing multiple comparisons.

I did some work with a DNA tiling array... literally millions of multiple comparisons Fortunately, there are pretty cool methods for determining a 'q-value' (minimum false discovery rate).

I've also seen cases where a P-value of 0.9 is reached and the results are dismissed as if there's no correlation at all.

@Jay:
A p-value of .9 implies that the results depart from null by chance. Why wouldn't such results be dismissed?

I should said "most likely depart from null by chance."

PZ. On what was the theoretical curve based? Was it an estimate of likelihood of publication at a particular p-value? In short is the monotonically decreasing curve an estimate of the file-drawer problem effect? Or is it an estimate of the increased likelihood of a study being performed when the likelihood of finding an effect increases?

What's interesting is the lack of a compensating dip just before .05, which one might predict if a displacing shift were occurring.

@ReasJack:

On what was the theoretical curve based?

Contrary to what PZ wrote, the solid line in the curve was not calculated from theoretical work. Rather, the paper clearly states that it is a fitted exponential regression curve. The authors of the original article state that the fitted curve is consistent with theoretical papers that have shown that the distribution of p-values should be exponential.

The consistency is strange. Experimental psychology journals have institutionalized publication bias in that they generally require the p-value to be less than .05 to accept a paper. Therefore, the p-values above .05 in the figure must have primarily been reported in papers which also reported a p-value below .05. In contrast, no such publication bias was assumed in the theoretical papers. It is unclear, then, why the shape of the distribution in the present paper would be consistent with the theoretical distribution.

What’s interesting is the lack of a compensating dip just before .05, which one might predict if a displacing shift were occurring.

Since the p-values above .05 in the figure must have come primarily from papers that reporting one or more of the p-values below .05 in the figure, it's unclear (to me, anyway) why the curve has the shape it does. In the absence of the .05 cutoff for publication, I agree that we should see a dip just above .05. However, the figure only shows p-values up to .10, and the entire interval from .05 to .10 is likely to appear "just above .05" to a researcher. P-values in this interval, which are often termed "marginally significant" or "trending toward significance," could motivate a researcher to take statistically questionable steps to produce a smaller p-value.

That said, I'm not convinced that the excess of p-values just below .05 is due to mere nudging of borderline findings over the line. As Simmons et al (2011) showed, by exercising "researcher degrees of freedom," researchers can produce a significant p-value for any false hypothesis. And Bem (2011), researching pre-cognition, a hypothesis that is almost certainly false, was able to come up with something on the order of 20 significant p-values across nine experiments. Thus it seems possible that the source of the observed excess significant p-values could be studies whose p-values should have been any value greater than .05.

La Wii ha un'ottimo sistema di ilalninuziome, ma sembra scarsa nella gestione del bump mapping sulle texture. Effettivamente raramente si vede questo effetto nei giochi, come del resto cosi' accadeva sul gamecube.A me sinceramente non dispiace che sia su binari, e l'ondata di giochi di questo tipo era prevedibile che sarebbe scoppiata, anche perch il sistema di puntamento e fatto a pennello per questi giochi.Sicuramente pi avanti uscir anche la versione in terza persona del titolo capcom, ma si vede che non' la loro priorit .