What is “significant” research? In most psychology journals, “significant” results are those measuring up to a difficult-to-understand statistical standard called a null-hypothesis significance test. This test, which seems embedded and timeless, actually has its origins in theoretical arguments less than a century old.
Today’s gold standard of statistical significance is the p value, described by Ronald Fisher less than 50 years ago. Many people, even many active researchers, don’t understand much about the p value other than when it’s less than .05, the research is usually considered significant. But what does p = .05 really mean? It doesn’t mean there’s 5 percent error rate in our data, or that our results are 95 percent likely to be true. It means that if the null hypothesis is true, then repeating the study should get the result we found or a more extreme result 5 percent of the time.
Understanding what all that means requires an understanding of the concept of “null hypothesis,” of experimental design, and probability. And it’s still expressing the results of the study in a negative way. Suppose my study finds that men choose to date women with attractive faces more often than they choose women with the traits they say they like (sense of humor, intelligence, and so on), and that these two statistics (number of men choosing attractive women and number of men choosing women with other desired traits) are found to be different, with a p value of .05.
What I’d really like to know is how likely it is that my results accurately represent the preferences of all men. But what the p value tells me is that if those two statistics really are not different in the general population, how often a study like mine would come up with my (incorrect) result. It’s either a double-negative or a triple-negative; I’m not sure.
Surely there’s a more intuitive way to demonstrate the significance of a study. Peter Killeen believes he’s found it, and the Association for Psychological Science (the APS — not to be confused with the APA, the American Psychological Association) has adopted this new measure of significance in its highly-respected journals. The measure is denoted as prep, or the probability of replicating an effect.
Depending on certain characteristics of the group being studied, Fisher’s p values can have very different implications. In some cases, especially when an effect is small, “significant” p values can be difficult to reproduce in follow-up studies. Killeen’s prep avoids this problem by taking into account the statistical power of a study.
How does prep work? It’s difficult to explain how the number is generated, but it’s easy to argue that the resulting value is intuitively simpler to grasp than a traditional p value. Very roughly, a prep gives an approximation of the probability that a particular result, repeated on a new sample, would be observed again.
Let’s return to my original example. If my study finds that more men choose attractive women than choose intelligent women, with a prep of .917, I can say that more than 9 times out of 10, if I repeated the same study, more men would still choose attractive women. (Actually, that’s overgeneralizing. prep is an average probability of replication, so in practice the actual probability might be a little different — but it should be very close to prep.)
Killeen’s prep is now beginning to appear in APS journal articles. It hasn’t yet been adopted by other journals, but if you understand p-values and you’d like to have a general sense of how prep works, this table, created by Geoff Cumming, may help.
So given a typical population, a p of .05 is about like a prep of .917.
One problem with this new requirement for APS journals is that there is no longer a set standard for significance like p < .05: this is now a judgment call for peer reviewers to make. We’ll be reviewing some articles that use this new standard in the coming weeks, so it will be interesting to see how prep is implemented in practice.
Killeen, P.R. (2005). An alternative to null-hypothesis significance tests. Psychological Science, 16(5), 345-353.
Cumming, G. (2005). Understanding the average probability of replication. Psychological Science, 16(12), 1002-1004.