Most researchers don't understand error bars

[This post was originally published in March 2007]

ResearchBlogging.orgEarlier today I posted a poll [and I republished that poll yesterday] challenging Cognitive Daily readers to show me that they understand error bars -- those little I-shaped indicators of statistical power you sometimes see on graphs. I was quite confident that they wouldn't succeed. Why was I so sure? Because in 2005, a team led by Sarah Belia conducted a study of hundreds of researchers who had published articles in top psychology, neuroscience, and medical journals. Only a small portion of them could demonstrate accurate knowledge of how error bars relate to significance. If published researchers can't do it, should we expect casual blog readers to?

Confidence Intervals
First off, we need to know the correct answer to the problem, which requires a bit of explanation. The concept of confidence interval comes from the fact that very few studies actually measure an entire population. We might measure reaction times of 50 women in order to make generalizations about reaction times of all the women in the world. The true mean reaction time for all women is unknowable, but when we speak of a 95 percent confidence interval around our mean for the 50 women we happened to test, we are saying that if we repeatedly studied a different random sample of 50 women, 95 percent of the time, the true mean for all women will fall within the confidence interval.

Now suppose we want to know if men's reaction times are different from women's reaction times. We can study 50 men, compute the 95 percent confidence interval, and compare the two means and their respective confidence intervals, perhaps in a graph that looks very similar to Figure 1 above. If Group 1 is women and Group 2 is men, then the graph is saying that there's a 95 percent chance that the true mean for all women falls within the confidence interval for Group 1, and a 95 percent chance that the true mean for all men falls within the confidence interval for Group 2. The question is, how close can the confidence intervals be to each other and still show a significant difference?

In psychology and neuroscience, this standard is met when p is less than .05, meaning that there is less than a 5 percent chance that this data misrepresents the true difference (or lack thereof) between the means. I won't go into the statistics behind this, but if the groups are roughly the same size and have the roughly the same-size confidence intervals, this graph shows the answer to the problem Belia's team proposed:

i-b546821b8694cbb9a981a94b823a1d55-belia2.gif

The confidence intervals can overlap by as much as 25 percent of their total length and still show a significant difference between the means for each group. Any more overlap and the results will not be significant. So how many of the researchers Belia's team studied came up with the correct answer? Just 35 percent were even in the ballpark -- within 25 percent of the correct gap between the means. Over thirty percent of respondents said that the correct answer was when the confidence intervals just touched -- much too strict a standard, for this corresponds to p<.006, or less than a 1 percent chance that the true means are not different from each other, compared to the accepted p<.05.

Standard Errors
But perhaps the study participants were simply confusing the concept of confidence interval with standard error. In many disciplines, standard error is much more commonly used. So Belia's team randomly assigned one third of the group to look at a graph reporting standard error instead of a 95% confidence interval:

i-5545946260f233f9e787fd7a49c3cbe3-belia3.gif

How did they do on this task? Once again, first a little explanation is necessary. Standard errors are typically smaller than confidence intervals. For reasonably large groups, they represent a 68 percent chance that the true mean falls within the range of standard error -- most of the time they are roughly equivalent to a 68% confidence interval. In fact, a crude rule of thumb is that when standard errors overlap, assuming we're talking about two different groups, then the difference between the means for the two groups is not significant.

Actually, for purposes of eyeballing a graph, the standard error ranges must be separated by about half the width of the error bars before the difference is significant. The following graph shows the answer to the problem:

i-4d0fc2fbd6ad380703be7b64cf845c14-belia4.gif

Only 41 percent of respondents got it right -- overall, they were too generous, putting the means too close together. Nearly 30 percent made the error bars just touch, which corresponds to a significance level of just p<.16, compared to the accepted p<.05.

When error bars don't apply
The final third of the group was given a "trick" question. They were shown a figure similar to those above, but told that the graph represented a pre-test and post-test of the same group of individuals. Because retests of the same individuals are very highly correlated, error bars cannot be used to determine significance. Only 11 percent of respondents indicated they noticed the problem by typing a comment in the allotted space. Incidentally, the CogDaily graphs which elicited the most recent plea for error bars do show a test-retest method, so error bars in that case would be inappropriate at best and misleading at worst.

Belia's team recommends that researchers make more use of error bars -- specifically, confidence intervals -- and educate themselves and their students on how to understand them.

You might argue that Cognitive Daily's approach of avoiding error bars altogether is a bit of a copout. But we think we give enough explanatory information in the text of our posts to demonstrate the significance of researchers' claims. Moreover, since many journal articles still don't include error bars of any sort, it is often difficult or even impossible for us to do so. And those who do understand error bars can always look up the original journal articles if they need that information. Still, with the knowledge that most people -- even most researchers -- don't understand error bars, I'd be interested to hear our readers make the case for whether or not we should include them in our posts.

Belia, S, Fidler, F, Williams, J, Cumming, G (2005). Researchers misunderstand confidence intervals and standard error bars Psychological Methods, 10 (4), 389-396 DOI:

More like this

Earlier today I posted a poll challenging Cognitive Daily readers to show me that they understand error bars -- those little I-shaped indicators of statistical power you sometimes see on graphs. I was quite confident that they wouldn't succeed. Why was I so sure? Because in 2005, a team led by…
What is "significant" research? In most psychology journals, "significant" results are those measuring up to a difficult-to-understand statistical standard called a null-hypothesis significance test. This test, which seems embedded and timeless, actually has its origins in theoretical arguments…
Today I had to put off my normal morning run in order to make time to be interviewed on a radio show at 7:30 a.m. As I waited on hold for the interview to start, I could hear the hosts joking back-and-forth about what the "latest TV controversy" is. "Is it the Jay Leno / Conan O'Brien news on NBC…
I was struck by this paper that came out in the Journal of Child Neurology, looking back at previous study of mercury levels in autistic children. DeSoto and Hitlan looked back at Ip et al. 2004, a case control study that compared the blood and hair levels of mercury in children with autism to…

Thank you for stressing the definition of a confidence interval. I am repeatedly telling students that C.I. is about the process. The true population mean is fixed and unknown. If we repeat our procedure many many times 95% of the time we will generate error bars that contain the true mean. It is not correct to say that there is a 5% chance the true mean is outside of the error bars we generated from this one sample. A subtle but really important difference

I say that the only way people (including researchers) are going to finally get a grip on error bars is by being exposed to them. Keep doing what you're doing, but put the bars in too. After all, knowledge is power!

Hi there,
I agree with your initial approach: simplicity of graphs, combined with clear interpretation of results (based on information that we, readers, don't have).

For those of us who would like to go one step further and play with our Minitab, could I safely assume that the Cognitive daily team is open to share their raw data source on request?

Simple communication is often effective communication..

P-A
http://devrouze.blogspot.com/

Perhaps a poll asking CogDaily readers: (a) how many want error bars; (b) how many don't; and (c) how many don't care may be in order.

By Tony Jeremiah (not verified) on 31 Jul 2008 #permalink

And then there are all the articles that don't label what their error bars are, nor under what test they are generated. Are they the points where the t-test drops to 0.025? Quantiles of a bootstrap? And someone in a talk recently at 99% confidence error bars, which rather changed the interpretation of some of his data.

Personally I think standard error is a bad choice because it's only well defined for Gaussian statistics, but my labmates informed me that if they try to publish with 95% CI, a lot of reviewers will go after them for not being able to do experiments well. Standard error gives smaller bars, so the reviewers like them more.

And then there was the poor guy who tried to publish a box and whisker plot of a bunch of data with factors on the x-axis, and the reviewers went ape. They insisted the only right way to do this was to show individual dots for each data point.

Both cases are in molecular biology, unsurprisingly.

Frederick,
You state "Personally I think standard error is a bad choice because it's only well defined for Gaussian statistics..."
I am likely misunderstanding your comment or the Central Limit Theorem, but I thought, and have seen demonstrated in many Java applets, that even if the original sampling distribution is robustly non-normal, repeated samples of n > about 30 will result in a normal distribution of sample means, against which one can test the sample mean.

Yes, the point of standard error is that the sampling distribution of the mean is (asymptotically) Gaussian, per the Central Limit Theorem. This holds in almost any situation you would care about in the real world.

"the graph is saying that there's a 95 percent chance that the true mean for all women falls within the confidence interval for Group 1"

Sorry but this is wrong. It is true that if you repeated the experiment many many times, 95% of the intervals so generated would contain the correct value. This is NOT the same thing as saying that the specific interval plotted has a 95% chance of containing the true mean. The former is a statement of frequentist probability representing the results of repeated sampling, and the latter is a statement of Bayesian probability based on a degree of belief. The distinction may seem subtle but it is absolutely fundamental, and confusing the two concepts can lead to a number of fallacies and errors.

Thanks for such an interesting explanation. I still think some error bars here and there might be helpful, for those who want to research & stuff.

BTW, which graphing software are you using to make those graphs that I see in every CogDaily post?

The tradition to use SEM in psychology is unfortunate because you can't just look at the graph and determine significance, but you do get some idea of the error term just by looking at it. The SEM bars often do tell you when it's not significant (i.e. if they overlap). In any case, the text should tell you which actual significance test was used. But I agree that not putting any indication of variation or error on the graph renders the graph un-interpretable. If I don't see an error bar I lose a lot of confidence in the analysis.

For many purposes, the difference between SE and 95% is just noise. Error bars, even without any education whatsoever, at least give a feeling for the rough accuracy of the data. Do the bars overlap 25% or are they separated 50%? That's splitting hairs, and might be relevant if you actually need a precise answer.

Almost always, I'm not looking for that precise answer: I just want to know very roughly whether two classes are distinguishable. Often enough these bars overlap either enormously or obviously not at all - and error bars give you a quick & dirty idea of whether a result might mean something - and quick comprehension is a valuable thing.

There is an option for the third category of data above, 'when error bars don't apply.' You can create CIs based on within-subject variance, rather than between-subject variance (Loftus & Masson, 1994; Masson & Loftus, 2003). This is becoming pretty popular in the literature...

By appositive (not verified) on 23 Aug 2008 #permalink

I just read about confidence intervals and significance in my book Error Analysis. Now, I understand what you meant.

So standard "error" is just standard deviation, eh?

And I suppose the 95% confidence intervals are just approx. 2 times the standard deviation, right?

No, standard error of measurement is different from standard deviation. The mathematical difference is hard to explain quickly in a blog post, but this page has a pretty good basic definition of standard error, standard deviation, and confidence interval. Anyone have a better link for Freiddie?

Um... It says "Standard Error of the Mean"? My textbook calls it the "Standard Deviation of the Mean". Are these two the same then?

Ah, statisticians are making life confusing for undergrads.

Question...Ok, so the true mean in the general population in unknown. If I were to take a bunch of samples to get the mean & CI from a sample population, 95% of the time the interval I specified will include the true mean. How do I go from that fact to specifying the likelihood that my sample mean is equal to the true mean? I was asked this sort of question on a stat test in college and remember breaking my brain over it. I just couldn't logically figure out how the information I was working with could possibly answer that question...

Thanks for rerunning a great article -- I missed it the first time. In case anyone is interested, one of the our statistical instructors has used this post as a starting point in expounding on the use of error bars in a recent JMP blog post, What Good Are Error Bars?.