[This post was originally published in March 2007]
Earlier today I posted a poll [and I republished that poll yesterday] challenging Cognitive Daily readers to show me that they understand error bars — those little I-shaped indicators of statistical power you sometimes see on graphs. I was quite confident that they wouldn’t succeed. Why was I so sure? Because in 2005, a team led by Sarah Belia conducted a study of hundreds of researchers who had published articles in top psychology, neuroscience, and medical journals. Only a small portion of them could demonstrate accurate knowledge of how error bars relate to significance. If published researchers can’t do it, should we expect casual blog readers to?
First off, we need to know the correct answer to the problem, which requires a bit of explanation. The concept of confidence interval comes from the fact that very few studies actually measure an entire population. We might measure reaction times of 50 women in order to make generalizations about reaction times of all the women in the world. The true mean reaction time for all women is unknowable, but when we speak of a 95 percent confidence interval around our mean for the 50 women we happened to test, we are saying that if we repeatedly studied a different random sample of 50 women, 95 percent of the time, the true mean for all women will fall within the confidence interval.
Now suppose we want to know if men’s reaction times are different from women’s reaction times. We can study 50 men, compute the 95 percent confidence interval, and compare the two means and their respective confidence intervals, perhaps in a graph that looks very similar to Figure 1 above. If Group 1 is women and Group 2 is men, then the graph is saying that there’s a 95 percent chance that the true mean for all women falls within the confidence interval for Group 1, and a 95 percent chance that the true mean for all men falls within the confidence interval for Group 2. The question is, how close can the confidence intervals be to each other and still show a significant difference?
In psychology and neuroscience, this standard is met when p is less than .05, meaning that there is less than a 5 percent chance that this data misrepresents the true difference (or lack thereof) between the means. I won’t go into the statistics behind this, but if the groups are roughly the same size and have the roughly the same-size confidence intervals, this graph shows the answer to the problem Belia’s team proposed:
The confidence intervals can overlap by as much as 25 percent of their total length and still show a significant difference between the means for each group. Any more overlap and the results will not be significant. So how many of the researchers Belia’s team studied came up with the correct answer? Just 35 percent were even in the ballpark — within 25 percent of the correct gap between the means. Over thirty percent of respondents said that the correct answer was when the confidence intervals just touched — much too strict a standard, for this corresponds to p<.006, or less than a 1 percent chance that the true means are not different from each other, compared to the accepted p<.05.
But perhaps the study participants were simply confusing the concept of confidence interval with standard error. In many disciplines, standard error is much more commonly used. So Belia’s team randomly assigned one third of the group to look at a graph reporting standard error instead of a 95% confidence interval:
How did they do on this task? Once again, first a little explanation is necessary. Standard errors are typically smaller than confidence intervals. For reasonably large groups, they represent a 68 percent chance that the true mean falls within the range of standard error — most of the time they are roughly equivalent to a 68% confidence interval. In fact, a crude rule of thumb is that when standard errors overlap, assuming we’re talking about two different groups, then the difference between the means for the two groups is not significant.
Actually, for purposes of eyeballing a graph, the standard error ranges must be separated by about half the width of the error bars before the difference is significant. The following graph shows the answer to the problem:
Only 41 percent of respondents got it right — overall, they were too generous, putting the means too close together. Nearly 30 percent made the error bars just touch, which corresponds to a significance level of just p<.16, compared to the accepted p<.05.
When error bars don’t apply
The final third of the group was given a “trick” question. They were shown a figure similar to those above, but told that the graph represented a pre-test and post-test of the same group of individuals. Because retests of the same individuals are very highly correlated, error bars cannot be used to determine significance. Only 11 percent of respondents indicated they noticed the problem by typing a comment in the allotted space. Incidentally, the CogDaily graphs which elicited the most recent plea for error bars do show a test-retest method, so error bars in that case would be inappropriate at best and misleading at worst.
Belia’s team recommends that researchers make more use of error bars — specifically, confidence intervals — and educate themselves and their students on how to understand them.
You might argue that Cognitive Daily’s approach of avoiding error bars altogether is a bit of a copout. But we think we give enough explanatory information in the text of our posts to demonstrate the significance of researchers’ claims. Moreover, since many journal articles still don’t include error bars of any sort, it is often difficult or even impossible for us to do so. And those who do understand error bars can always look up the original journal articles if they need that information. Still, with the knowledge that most people — even most researchers — don’t understand error bars, I’d be interested to hear our readers make the case for whether or not we should include them in our posts.