Pharyngula

Live by statistics, die by statistics

There is a magic and arbitrary line in ordinary statistical testing: the p level of 0.05. What that basically means is that if the p level of a comparison between two distributions is less than 0.05, there is a less than 5% chance that your results can be accounted for by accident. We’ll often say that having p<0.05 means your result is statistically significant. Note that there’s nothing really special about 0.05; it’s just a commonly chosen dividing line.

Now a paper has come out that ought to make some psychologists, who use that p value criterion a lot in their work, feel a little concerned. The researchers analyzed the distribution of reported p values in 3 well-regarded journals in experimental psychology, and described the pattern.

Here’s one figure from the paper.

The solid line represents the expected distribution of p values. This was calculated from some theoretical statistical work.

…some theoretical papers offer insight into a likely distribution. Sellke, Bayarri, and Berger (2001) simulated p value distributions for various hypothetical effects and found that smaller p values were more likely than larger ones. Cumming (2008) likewise simulated large numbers of experiments so as to observe the various expected distributions of p.

The circles represent the actual distribution of p values in the published papers. Remember, 0.05 is the arbitrarily determined standard for significance; you don’t get accepted for publication if your observations don’t rise to that level.

Notice that unusual and gigantic hump in the distribution just below 0.05? Uh-oh.

I repeat, uh-oh. That looks like about half the papers that report p values just under 0.05 may have benefited from a little ‘adjustment’.

What that implies is that investigators whose work reaches only marginal statistical significance are scrambling to nudge their numbers below the 0.05 level. It’s not necessarily likely that they’re actually making up data, but there could be a sneakier bias: oh, we almost meet the criterion, let’s add a few more subjects and see if we can get it there. Oh, those data points are weird outliers, let’s throw them out. Oh, our initial parameter of interest didn’t meet the criterion, but this other incidental observation did, so let’s report one and not bother with the other.

But what it really means is that you should not trust published studies that only have marginal statistical significance. They may have been tweaked just a little bit to make them publishable. And that means that publication standards may be biasing the data.


Masicampo EJ, and Lalande DR (2012). A peculiar prevalence of p values just below .05. Quarterly journal of experimental psychology PMID: 22853650

Comments

  1. #1 Luke
    August 13, 2012

    ” Note that there’s nothing really special about 0.05; it’s just a commonly chosen dividing line” ….

    “But what it really means is that you should not trust published studies that only have marginal statistical significance.”

    You may want to reconsider the use of arbitrary in the first sentence if you’re going to follow it up with such pedantry. This is a problem of the human brain preferentially processing black and white decisions. Being slightly on one side or the other of the 0.05 cutoff should not determine the value or trustworthiness of the study.

  2. #2 Michael Kelsey
    SLAC National Accelerator Laboratory
    August 13, 2012

    @Luke: I think you’ve missed the point.

    The use of the specific value of P<0.05 to define "statistically significant" *IS* arbitrary. It is a nice round number, and just as arbitrary as the "5-sigma" (equivalent to P<0.0000005733) threshold used in my field of particle physics.

    If the literature as a whole contained a distribution of reported P values which was consistent with what you would expect from true randomness (i.e., our measurements are always honest estimators of some underlying true value, and the P values reported reflect those estimators), then your own last statement would be true. Any arbitrary cut would be equivalent to any other arbitrary cut. That theoretical distribution is what is shown by the solid curve in the figure.

    But, PZ Meyer's point is that we do *NOT* see a distribution consistent with honest estimators. Rather, what we see is a very narrow, and not statisticially consistent, pileup of P value just at (in fact, just below, so on the "good side") of the _arbitrary_ cut-off used by the scientfic community.

    Since, as you agree, the P<0.05 value of cutoff to define "significant" is arbitrary, that non-statistical pileup of reported P-values cannot be a reflection of the actual behaviour of measurements in reality. Instead, it is an indication of bias. I mean, as PZ Meyers means, "bias" in the proper statistical sense, that the reported measurement may not be an "honest estimator" of an underlying true value.

    There is, obviously, no way to know in any particular case, whether a given report with "P<0.049" or "P<0.043" is biased or not. But the pileup of values just on the edge of what is considered in the community as "interesting enough to publish" should lead the interested reader to study the paper with a bit more scrutiny.

    I would point out, as a particle physicist, that it would not surprise me in the least to find a similar pileup of papers reporting "5.1" or "5.4" sigma results, given the similarly arbitrary, and similarly important standard of "5-sigma" as defining a "discovery" (as opposed to just "evidence for"). I don't know whether such a meta-analysis has been done.

  3. #3 Eric Lund
    August 13, 2012

    Michael @2052: I would say, based on that chart, that a reported p = 0.043 is probably honest (in the sense that if it is in error, it is an honest error and not a reflection of experimenter bias). The circles in the chart are showing p values in bins of width 0.005, and it is only the 0.045 < p < 0.05 bin that shows a statistically significant excess (the other points look like reasonable scatter about the expected curve). But some of the results that fall into that bin are indeed suspect. I agree that checking for an excess of particle physics results with significance just above 5σ would be interesting. Particle physics uses the more rigorous threshold in part because they can (millions of events to work with), while psychology experiments that satisfy more rigorous tests than p < 0.05 are hard to design and execute (sample size being one limiting factor).

    Two other remarks:
    (1) From the lack of excess points just above p = 0.05, I see no evidence that people are trying to “disprove” hypotheses by shading p-values upward. You can get a publication from disproving an established hypothesis as well as by establishing a hypothesis.
    (2) Remember that even if the p value is honestly obtained, it may still be a false positive. When there should be no correlation, you have a 5% chance of getting p < 0.05.

  4. #4 Geoffrey Brent
    August 14, 2012

    #1: ” Being slightly on one side or the other of the 0.05 cutoff should not determine the value or trustworthiness of the study.”

    It doesn’t completely determine it but it certainly is evidence. Honest, sound research should deliver roughly the same number of results at p=0.049 as at 0.051. If we find that ‘just under’ values are much more common than ‘just over’, then it suggests the existence of a mechanism that generates spurious ‘just under’ results; applying Bayes’ Theorem then tells us that a ‘just under’ result is quite likely to be dodgy.

    #3: Yes, in theory you can get a publication by disproving an established hypothesis, but in general there’s a well-known bias towards publication of positive results. Finishing a paper is a lot of work, and researchers and publishers are only human; it’s hard to maintain enthusiasm for writing up a negative finding when you could be out exploring things that might give a positive.

  5. #5 Paul Holt
    Australia
    August 14, 2012

    Oh come on now, people? Isn’t this obvious? You are far more likely to write a paper as soon as you devise an experimental set-up capable of producing p<0.05 results. If the number is above 0.05, you are unlikely to publish. Later research might refine your experimental technique to bring p lower.

    The bias is not in the experiments, but in the biased selection criteria used in this study for selection of results, namely, that those results have been published.

    I conclude that my hypothesis is true with a probability of 95%. :)

  6. #6 Shane
    Belfast
    August 14, 2012

    What this observation doesn’t account for is that widely reported statistical feature, the “trend towards statistical significance”, which is conference-speak for “not statistically significant”…

    I very much doubt the problem is confined to the fluffy world of psychology :-)

  7. #7 _Arthur
    August 14, 2012

    I agree with Holt, many studies with results over 0.05 will simply not get published (not worth the trouble), so it is to be expected to have more studies under 0.05 or just under 0.05 than in the rest of the curve.

    Once that factor is accounted for, only then we can evaluate if we have more 0.049 results than we ought to have.

  8. #8 David Marjanović
    Museum für Naturkunde, Berlin
    August 14, 2012

    I’m tempted to quote “the closer you get to humans, the worse the science gets”, but I second the call for a metanalysis of results in particle physics and the like.

    Meyer’s

    Myers’.

    When there should be no correlation, you have a 5% chance of getting p < 0.05.

    In other words, if you throw 100 tests at such a situation, on average 5 of them will find p < 0.05. Corrections need to be done for multiple statistical tests of the same hypothesis; such corrections exist, but some authors and some reviewers may not know that.

  9. #9 _Arthur
    August 14, 2012

    It is commonplace, in quack research, to test a dozen variables during the same experiment, and if ANY of the variables reaches the magik p<0.5, then you claim that your nostrum has an effect.
    For example, you were testing a Royal Jelly and eucalyptus preparation for efficacy against asthma, and you measure it improved toenails growth with p<0.05. Success ! Your asthma medication WORKS!

  10. #10 Luke
    August 14, 2012

    @Michael_Kelsey I get the point, I was simply cautioning PZ not to throw out the baby with bathwater as he is sometimes wont to do.

    We should question every result, but unless you’re willing to assume that marginal significance equates with fraud and therefore completely undermines the integrity of the data, I just don’t see how this study in question suddenly means that “you should not trust published studies that only have marginal statistical significance”.

    If someone “massages” their dataset to move a marginally significant result to one side or the other, it is still roughly the same result. Why? Because the p-value cutoff is arbitrary!

    This is a serious systematic problem, but the only real solution I see is to NOT reject studies for publication simply because of marginal significance. Why not? Because the p-value cutoff is arbitrary!

    It’s an arbitrary distinction that has resulted in a non-arbitrary publication standard — and as we see here also results in publication bias or at worst fraud.

    For some of you the solution seems to be to just push the arbitrary line even further by discrediting studies with marginal significance!

    It’s no wonder that we only have two political parties — anything with more than two choices is too complex and time-consuming to deal with so we find a way to reduce it to binary choice problem.

    And no, I am not some disgruntled academic with marginally significant, unpublished results. Seriously. Thanks for asking though.

  11. #11 rditmars
    August 14, 2012

    Here is a review of this subject, worth having around for the title alone.

    http://ist-socrates.berkeley.edu/~maccoun/PP279_Cohen1.pdf

    The practical significance of a measured effect size is more important than its statistical significance.

  12. #12 David Marjanović
    Museum für Naturkunde, Berlin
    August 15, 2012

    We should question every result, but unless you’re willing to assume that marginal significance equates with fraud and therefore completely undermines the integrity of the data, I just don’t see how this study in question suddenly means that “you should not trust published studies that only have marginal statistical significance”.

    You don’t need to assume it’s fraud; you only need to assume it’s bias.

  13. #13 Luke
    August 16, 2012

    @David You’re still missing my point. Marginally significant results are not worthless. The line is arbitrary, being on one side of it or the other does not tell you the practical significance of the results. Therefore, it comes down to the degree of bias you want to assume for these “biased” results — which is basically unknowable. If you think it’s a slight bias then I don’t think it changes much — it’s an arbitrary line anyways. If you think that this suggests greater fraud, then as always with fraud we’re basically screwed.

  14. #14 travc
    August 19, 2012

    _Arthur…
    Doing multiple comparisons/experiments and reporting only the significant ones is a big no-no. I’m not saying it doesn’t happen, but we do have well established methods which adjust downward what p-value is required to be “significant” when doing multiple comparisons.

    I did some work with a DNA tiling array… literally millions of multiple comparisons Fortunately, there are pretty cool methods for determining a ‘q-value’ (minimum false discovery rate).

  15. #15 Jay
    August 19, 2012

    I’ve also seen cases where a P-value of 0.9 is reached and the results are dismissed as if there’s no correlation at all.

  16. #16 jt512 
    August 19, 2012

    @Jay:
    A p-value of .9 implies that the results depart from null by chance. Why wouldn’t such results be dismissed?

  17. #17 jt512 
    August 19, 2012

    I should said “most likely depart from null by chance.”

  18. #18 ReasJack
    Cleveland
    August 21, 2012

    PZ. On what was the theoretical curve based? Was it an estimate of likelihood of publication at a particular p-value? In short is the monotonically decreasing curve an estimate of the file-drawer problem effect? Or is it an estimate of the increased likelihood of a study being performed when the likelihood of finding an effect increases?

    What’s interesting is the lack of a compensating dip just before .05, which one might predict if a displacing shift were occurring.

  19. #19 jt512
    August 21, 2012

    @ReasJack:

    On what was the theoretical curve based?

    Contrary to what PZ wrote, the solid line in the curve was not calculated from theoretical work. Rather, the paper clearly states that it is a fitted exponential regression curve. The authors of the original article state that the fitted curve is consistent with theoretical papers that have shown that the distribution of p-values should be exponential.

    The consistency is strange. Experimental psychology journals have institutionalized publication bias in that they generally require the p-value to be less than .05 to accept a paper. Therefore, the p-values above .05 in the figure must have primarily been reported in papers which also reported a p-value below .05. In contrast, no such publication bias was assumed in the theoretical papers. It is unclear, then, why the shape of the distribution in the present paper would be consistent with the theoretical distribution.

    What’s interesting is the lack of a compensating dip just before .05, which one might predict if a displacing shift were occurring.

    Since the p-values above .05 in the figure must have come primarily from papers that reporting one or more of the p-values below .05 in the figure, it’s unclear (to me, anyway) why the curve has the shape it does. In the absence of the .05 cutoff for publication, I agree that we should see a dip just above .05. However, the figure only shows p-values up to .10, and the entire interval from .05 to .10 is likely to appear “just above .05″ to a researcher. P-values in this interval, which are often termed “marginally significant” or “trending toward significance,” could motivate a researcher to take statistically questionable steps to produce a smaller p-value.

    That said, I’m not convinced that the excess of p-values just below .05 is due to mere nudging of borderline findings over the line. As Simmons et al (2011) showed, by exercising “researcher degrees of freedom,” researchers can produce a significant p-value for any false hypothesis. And Bem (2011), researching pre-cognition, a hypothesis that is almost certainly false, was able to come up with something on the order of 20 significant p-values across nine experiments. Thus it seems possible that the source of the observed excess significant p-values could be studies whose p-values should have been any value greater than .05.

  20. #20 Mido
    UKndvqMDhOgdK
    August 25, 2012

    La Wii ha un’ottimo sistema di ilalninuziome, ma sembra scarsa nella gestione del bump mapping sulle texture. Effettivamente raramente si vede questo effetto nei giochi, come del resto cosi’ accadeva sul gamecube.A me sinceramente non dispiace che sia su binari, e l’ondata di giochi di questo tipo era prevedibile che sarebbe scoppiata, anche perch il sistema di puntamento e fatto a pennello per questi giochi.Sicuramente pi avanti uscir anche la versione in terza persona del titolo capcom, ma si vede che non’ la loro priorit .