A new statistic begins to appear in journals: What the heck is a p-rep?

i-eca0cf2af9fc3ac4445c7dff7d8aab70-research.gif

What is "significant" research? In most psychology journals, "significant" results are those measuring up to a difficult-to-understand statistical standard called a null-hypothesis significance test. This test, which seems embedded and timeless, actually has its origins in theoretical arguments less than a century old.

Today's gold standard of statistical significance is the p value, described by Ronald Fisher less than 50 years ago. Many people, even many active researchers, don't understand much about the p value other than when it's less than .05, the research is usually considered significant. But what does p = .05 really mean? It doesn't mean there's 5 percent error rate in our data, or that our results are 95 percent likely to be true. It means that if the null hypothesis is true, then repeating the study should get the result we found or a more extreme result 5 percent of the time.

Huh?

Understanding what all that means requires an understanding of the concept of "null hypothesis," of experimental design, and probability. And it's still expressing the results of the study in a negative way. Suppose my study finds that men choose to date women with attractive faces more often than they choose women with the traits they say they like (sense of humor, intelligence, and so on), and that these two statistics (number of men choosing attractive women and number of men choosing women with other desired traits) are found to be different, with a p value of .05.

What I'd really like to know is how likely it is that my results accurately represent the preferences of all men. But what the p value tells me is that if those two statistics really are not different in the general population, how often a study like mine would come up with my (incorrect) result. It's either a double-negative or a triple-negative; I'm not sure.

Surely there's a more intuitive way to demonstrate the significance of a study. Peter Killeen believes he's found it, and the Association for Psychological Science (the APS -- not to be confused with the APA, the American Psychological Association) has adopted this new measure of significance in its highly-respected journals. The measure is denoted as prep, or the probability of replicating an effect.

Depending on certain characteristics of the group being studied, Fisher's p values can have very different implications. In some cases, especially when an effect is small, "significant" p values can be difficult to reproduce in follow-up studies. Killeen's prep avoids this problem by taking into account the statistical power of a study.

How does prep work? It's difficult to explain how the number is generated, but it's easy to argue that the resulting value is intuitively simpler to grasp than a traditional p value. Very roughly, a prep gives an approximation of the probability that a particular result, repeated on a new sample, would be observed again.

Let's return to my original example. If my study finds that more men choose attractive women than choose intelligent women, with a prep of .917, I can say that more than 9 times out of 10, if I repeated the same study, more men would still choose attractive women. (Actually, that's overgeneralizing. prep is an average probability of replication, so in practice the actual probability might be a little different -- but it should be very close to prep.)

Killeen's prep is now beginning to appear in APS journal articles. It hasn't yet been adopted by other journals, but if you understand p-values and you'd like to have a general sense of how prep works, this table, created by Geoff Cumming, may help.

i-6f24d68b9464c4c6f09c2015fd918c41-killeen.gif

So given a typical population, a p of .05 is about like a prep of .917.

One problem with this new requirement for APS journals is that there is no longer a set standard for significance like p < .05: this is now a judgment call for peer reviewers to make. We'll be reviewing some articles that use this new standard in the coming weeks, so it will be interesting to see how prep is implemented in practice.

Killeen, P.R. (2005). An alternative to null-hypothesis significance tests. Psychological Science, 16(5), 345-353.

Cumming, G. (2005). Understanding the average probability of replication. Psychological Science, 16(12), 1002-1004.

Tags

More like this

Earlier today I posted a poll challenging Cognitive Daily readers to show me that they understand error bars -- those little I-shaped indicators of statistical power you sometimes see on graphs. I was quite confident that they wouldn't succeed. Why was I so sure? Because in 2005, a team led by…
[This post was originally published in March 2007] Earlier today I posted a poll [and I republished that poll yesterday] challenging Cognitive Daily readers to show me that they understand error bars -- those little I-shaped indicators of statistical power you sometimes see on graphs. I was quite…
I was struck by this paper that came out in the Journal of Child Neurology, looking back at previous study of mercury levels in autistic children. DeSoto and Hitlan looked back at Ip et al. 2004, a case control study that compared the blood and hair levels of mercury in children with autism to…
There's been some hubbub recently over a study by Gerber and Malhotra (you can get a copy in pdf here), which shows a couple things. First, political science journals don't publish many articles that report negative (null) results, but instead tend to publish those that report statistically…

Actually you've misrepresented the meaning of a p Value: It's the probability of finding the current result or a more extreme result if the null hypothesis is true. Or, to paraphrase your words:
It means that if the null hypothesis is true, then repeating the study should get the result we found OR A MORE EXTREME RESULT 5 percent of the time.

Please correct your article to avoid contributing to the confusion.

It's probably wise to add, "Given no bias" since statistical tests like this don't evaluate bias, which is systematic error, not random error.

I'm still a little confused about what prep is (and is not). If it incorporates estimates of effect size, then how can one establish a correspondence between p-values and p-rep values, as suggested by Cumming's table? The existence of such a relationship suggests that either effect size has little to do with the calculation of prep, it's influence is relatively small, or effect size influences the p-value which in turn influences the value of prep. Whatever, it begs the question, "What does prep buy us, over-and-above what we know on the basis of a p-value?"

"How does prep work? It's difficult to explain how the number is generated, but it's easy to argue that the resulting value is intuitively simpler to grasp than a traditional p value. Very roughly, a prep gives an approximation of the probability that a particular result, repeated on a new sample, would be observed again."

This is a completely bogus argument! Also, I think that stating that the p-value is difficult to understand only speaks to your level of research and statistical knowledge! Then again, on the internet anyone with a blog can state anything and sound like a published author ...

the internet anyone with a blog can state anything and sound like a published author

You can, too, in the comments section.

I, however, am referring to the article by Killeen, where he makes essentially the same argument. That said, regardless of your publication record, I'd love to hear your case for why p-values are easier to understand than p-reps. I've made my case, and I don't see anything in your comment to refute it other than ad-hominem attacks.

Mmmm ... Alex's rather vitriolic post has me thinking. Is it the case that 'prep' indicates the probability of obtaining an effect in the same direction as that observed were the experiment to be replicated? That is, the likelihood of getting an effect in the observed direction (e.g. A > B), albeit of lesser or greater magnitude in terms of the actual effect size (i.e. the difference between A and B may be much smaller or much larger, but will be in the same direction)? That may help explain why it is possible to link p-values and prep values, although a tight correspondence suggests that using 'prep' over 'p' results in a benefit in terms of one's intuitions about the meaning of the value, rather than any statistical benefit per se.

I'd always just interpreted the p-value as meaning "the chance that the thing we observed is due to chance" or "the chance that this is just a statistical fluke".

Not quite as rigorious as the definitions giving above, but I think they amount to the same thing, and people I speak with seem to find it intuitively satisifying.

The underlying logic of p-rep seems to be the same as that of meta-analysis, with the primary difference being that the former involves a priori replication analysis while the latter involves post-hoc replication analyses.

Specifically, it seems that the purpose of p-rep is to determine the chances that one will replicate the observed p value from a particular study given the study's parameters (e.g., sample size, effect size and variability). This is different from the p-value from any one study because it only tells you that for the particular sample used in the study, the research hypothesis tested was or was not supported based on an arbitrarily chosen probability cutoff (i.e., p = .05).

Meta-analysis seems to involve the same principle, only the analyses are undertaken AFTER studies have been conducted. Therefore, it seems that the p-rep statistic might be important either: (a) as a useful parameter for conducting meta-analyses; and/or (b) might make meta-analyses irrelevant if one can determine a priori what research hypotheses are practically (and not statisticallly) significant (i.e., they have a high probability of producing replicable, statistically significant results). Killen states it best when he writes:

When replicability becomes the criterion, researchers can guage the risks they face in pursuing a line of study: An assistant professor may choose paradigms in which p-rep is typically greater than .8, whereas a tenured risk taker may hope to reduce (sigma-delta squared--standard error of estimate of effect size) in a line of research having p-reps around .6. When replicability becomes the criterion, significance, shorn of its statistical duty, can once again become a synonym for the importance of a result, not for its improbability. (p. 351)

By Tony Jeremiah (not verified) on 06 Sep 2007 #permalink

TomR, I think this is what makes Prep potentially very useful. I'm willing to bet a lot of working scientists have the same mental approximation of the meaning of p as you (I know I do). The benefit of Prep over p is that the intuitive definition is actually correct for Prep rather than an approximation.

If Prep comes into wide use I would expect .917 to be the benchmark number for the foreseeable future, if nothing else since it'd make it easily comparable with papers using p at 0.05. Or it'd get rounded off, to get 0.9 which is close to the usual p value and intuitively easy (one in ten chance).

I had to look up the original paper to find a statement of the definition that made any sense. It turns out, if I'm not mistaken, that p_rep is the probability of replicating the result (i.e., getting a result in the same direction) *conditional on the effect size in the population being the same as the measured effect size*. That means Janne's comment above is mistaken: p_rep (or rather 1-p_rep), like p, is *not* the probability that the result is a statistical fluke, or the probability that the measured result is due to chance. To find *that* probability, you have to translate p-values to Bayes factors.

I don't know. I thought we were on our way beyond p's of any kind. What about effect size measures?

In psychology, we will always find a difference. It's only a matter of how accurately and how often you measure. If you do it right, you will get a ridiculously small p (or large p_rep). That tells us a lot about the replicability of those data and how much professionalism was involved in collecting them, but it still tells us very little about their significance (without the qualifier "statistical").

So yes, the p_rep concept may be easier to grasp, but will it make it easier to understand scientific data?

I'm eager to read the original article and the other responses to it in the December '05 issue of PsySci (not just Cumming's) .

Whenever thinking about Null-Hypothesis testing I have to remember my old Stats prof when he taught us about p-values, effect size, statistical power etc. He basically said that we should remember one thing: the Null-hypothesis (i.e. there is NO difference between the conditions) is never really true. No matter what manipulation you do in your independent variable, it will have some effect on your dependent variable, only your sample may not have the statistical power to distinguish such a small difference from randomness, thus emphasizing the importance of effect sizes and power analysis.

I've been discussing this with a postdoc over lunch, and the more I think about it the less sense it makes to me.

(1) "an approximation of the probability that a particular result, repeated on a new sample, would be observed again." This cannot be as stated, because it is highly unlikely that any _particular_ value would be observed again, one needs ranges around it, and their size is an arbitrary "hidden" quantity.

(2) To replace a particular measure by another one with a one-to-one mapping (if non-linear) seems rather ridiculous, it's nearly the same as Celsius vs. Fahrenheit (though a replacement there would make sense to me ;-)).

(3) The reasoning "no one understands p-values" seems particularly weak for scientists. And the real problems, like the multiple testing fallacy and the need to look at effect sizes is not solved at all.

To me it looks now like a ridiculous modernisation, like PC speak, e.g. subjects vs. participants...

Michael:

Good points. To respond:

1: By "result" we mean an observable difference -- e.g., more men choose attractive women than intelligent women. How many more we observe, of course, can vary from sample to sample.

2: Sorry, I may have been unclear about that table. It's not a one-to-one mapping. The prep is a different statistic. It is only when certain characteristics of the population and sample are held constant that they can map on to each other.

3: You are right that many researchers understand p-values. However, it is true that many people, including many researchers, do not. The power of prep is both that it is more intuitive and that it *does* incorporate effect sizes.

I believe researchers should be able to understand the mathematical concepts that underlie their experiments.
Maybe some psychologists don't really know what they are doing, nor why they are doing it exactly this way, especially when it comes to statistics, but maybe they should not research, rather apply their "knowledge" to real world cases.
The p_rep sounds like something people would like to know, but is impossible to calculate without a lot more assumptions and estimates, than are needed for the traditional value.
Like something rather useless from a knowledge/science perspective. Why not stick to p for research and use an arbitrary p_rep value around 1-p as needed for the tv audience?

TomR: "I'd always just interpreted the p-value as meaning "the chance that the thing we observed is due to chance" or "the chance that this is just a statistical fluke".

Not quite as rigorious as the definitions giving above, but I think they amount to the same thing, and people I speak with seem to find it intuitively satisifying."

Not only not quite as rigorous but COMPLETELY WRONG!!! Unfortunately also quite a common misinterpretation. Here is an example. You filip a coin three times and it comes up heads three times. The chance of this happening if it is a fair con? .125 (.5 cubed). But does this mean it is a fair coin? (i.e., the three heads was "due to chance"?) NO. What if it was a coin that had heads on both sides (perfectly consistent with the evidence)?

The p-value is the probability of the result if the null was true and there was no bias. It doesn't say a thing about the case where the null is false or there is systematic error (bias).

This is a very, very serious mistake. What a statistically significant result does is make chance as the explanation unlikely for the result you see. Lack of significance says NOTHING about

Revere's comments seem somewhat overstated. It is not "COMPLETELY WRONG" or "a very, very serious mistake" to consider a p-value as TomR does, in the context of an experiment involving null hypothesis significance testing.

Thinking about a p-value as "the chance that the thing we observed is due to chance" or "the chance that this is just a statistical fluke" is a convenient shortcut. Not rigorous, no, but close enough for an intuitive paraphrase. The standard probability example that revere gives of coin tossing isn't an appropriate example here because it's not in an experimental context.

Say we want to test the hypothesis that "this coin is a biased coin" (i.e., not fair, with a greater probability of either heads or tails on each toss).
* We toss the coin a certain number of times and record the results. 3 times (as in revere's example) is not a large enough sample, but we'll run with it for now with the result HHH.
* We then analyse the results using an appropriate statistical test. In this case the binomial test will give p=.25 as a 2-tailed value (which is what we need rather than the 1-tailed p-.125 because our hypothesis is not directional).
* This p-value (p=.25) is formally the probability of getting a value of the test statistic as extreme as or more extreme than that observed by chance alone, if the null hypothesis is true (or the probability of a Type I error: mistakenly thinking the coin is biased when it is in fact fair).
* Assuming an alpha of .05, we would not reject the null hypothesis with a p=.25 and could not conclude the coin was biased. It is perfectly reasonable to paraphrase this by saying there is a 1-in-4 chance that the result is a fluke (i.e., that it is not due to a biased coin after all).

Let's take a larger sample.
* 20 coin tosses results in 20 heads. A binomial test given a 2-tailed p=.000002.
* We would now reject the null hypothesis, and conclude that the coin is biased. Again, it would be perfectly ok to paraphrase the meaning of this p-value by saying that there is only a tiny chance (1 in 500,000) that the result came about through fluke rather than a genuinely biased coin.

I realise this is completely separate to the p-rep argument, but it would be unfair to mislead TomR, Janne (and anyone else who uses this intuitive shortcut to conceptualise p-values) into thinking their understanding is flawed!

I appears that this p-rep would be a great aid to those of us that don't publish constantly, but read extensively in many psychological areas. It will probably take time for the statisticians to work out all the unhelpful implications. That seems to be the case for the entire field of statistics.

By James E. de Ja… (not verified) on 13 Sep 2007 #permalink

lcon, you (along with a large number of scientists) are completely wrong on this, there is no question about it. Your probabilistic statement is intrinsically Bayesian in nature, P(H|D) - the probability of a hypothesis given the data. However, NHST (and most other classical stats including this new p-rep) makes statements of the form P(D|H).

There is no automatic or objective transform between these statements. Well, there is Bayes Theorem, but it requires another input - the prior P(H). The probability of a coin being fair, given 3 heads in a row, necessarily depends on the probability of it being fair, before you started the experiment. And (IMO) that would depend on whether it was randomly picked from your pocket, or shown to you by a suspicious man on the street who offers to bet against you...

It is very worrying to see not only that people get this matter wrong, but that many of them are unable to understand the problem, or even acknowledge that it exists, when it is pointed out to them in simple terms. I've even found errors of this type in stats textbooks and peer-reviewed literature on statistical methods (and most recently documentation of some widely-used stats software libraries), but at least in these cases the authors did usually admit their error when challenged.

James,

I take your point that there is a logical distinction beteen P(H|D) and P(D|H), and I certainly don't deny that many scientists' statistical understanding falls short of the ideal, but my point is that not an end-of-the-world problem. The common paraphrse of a p-value as "the chance that the thing we observed is due to chance" is from a practical rather than a theoretical standpoint not wholly wrong.

It's a hack, an heuristic, an approximation that tends to set statisticians' teeth on edge. But does it lead people into erroneous interpretations of the meaning and impact of experiments? Does it render their interpretations of their own results (which, if published, are most likely significant) meaningless? I would argue no. If that's the paraphrase you know and use intuitively, then stick with it... just bear in mind that it's not statistically accurate.

P-rep will undoubtedly be interpreted with the shorthand "probability of replication", which at least has the advantage of being closer to the intended meaning of the statistic.

I don't usually post on these discussions but I have to agree with lcon on this one.

James, I am a statistician whose background is logic and maths. I also teach statistics to doctoral students across the sciences. After several years of trying to make them understand the finer points of statistical theory, I eventually realised that the 'hacky' understanding many of them held about p values still enabled them to carry out and interpret sophisticated experimental design and analyses.

As one student put it, he did not have to know the electrochemical makeup of a neuron to interpret an fMRI experiment: thinking of it in a 'hacky' way as neurons firing and using oxygen does the job.

In other words, the statistician in me gets very uncomfortable when I hear students vocalise their personal interpretations of p values, but the scientist in me can't find fault with their interpretations the stats when it comes to experimental research.

I'm all for p rep, by the way. It should be easier to explain to students when it hits the mainstream!

Well, it's good to see that we all agree on the basic facts. There's a bunch of climate scientists who have spent the last several years trying to redefine probability theory in order to validate their incorrect understanding of this point (which is why I am getting so dogmatic on the subject, not that I claim any great expertise).

As for does it matter, well I agree in practice it generally does not matter greatly, especially in cases where there is a reasonable prior belief that an effect of reasonable magnitude exists (and if not, the experiment probably wouldn't have been performed), which is probably why the incorrect interpretation has been able to survive and indeed flourish. But then on the occasion when the error does matter, as it undoubtedly does in some aspects of climate science, one risks having one's papers rejected on the grounds that everyone does it that way (yes really), and reams of incorrect literature is quoted back in defence of the error.

Is there actually a valid reason why scientists should not be taught correctly?

I'm not sure really what the point of p-rep is other than to enable frequentists to continue to misrepresent their results as something other than what they are really capable of assessing. Maybe it could be taken as an encouraging sign that they acknowledge that the p-value is routinely abused.