Earlier today I posted a poll challenging Cognitive Daily readers to show me that they understand error bars -- those little I-shaped indicators of statistical power you sometimes see on graphs. I was quite confident that they wouldn't succeed. Why was I so sure? Because in 2005, a team led by Sarah Belia conducted a study of hundreds of researchers who had published articles in top psychology, neuroscience, and medical journals. Only a small portion of them could demonstrate accurate knowledge of how error bars relate to significance. If published researchers can't do it, should we expect casual blog readers to?
Belia's team emailed over 3,000 authors of articles that had appeared in the top 10 peer-reviewed journals in each discipline, inviting them to take part in a quick web-based study of knowledge about graphical representation of means and error. Over a thousand visited the site, and 473 completed the study (the others may not have participated due to computer difficulties). One third of the respondents saw the following figure:
They were instructed to move the mean for Group 2 up or down until it was just significantly different from the mean for Group 1 (p<.05). How did they do?
First off, we need to know the correct answer to the problem, which requires a bit of explanation. The concept of confidence interval comes from the fact that very few studies actually measure an entire population. We might measure reaction times of 50 women in order to make generalizations about reaction times of all the women in the world. The true mean reaction time for all women is unknowable, but when we speak of a 95 percent confidence interval around our mean for the 50 women we happened to test, we are saying that if we repeatedly studied a different random sample of 50 women, 95 percent of the time, the true mean for all women will fall within the confidence interval.
Now suppose we want to know if men's reaction times are different from women's reaction times. We can study 50 men, compute the 95 percent confidence interval, and compare the two means and their respective confidence intervals, perhaps in a graph that looks very similar to Figure 1 above. If Group 1 is women and Group 2 is men, then the graph is saying that there's a 95 percent chance that the true mean for all women falls within the confidence interval for Group 1, and a 95 percent chance that the true mean for all men falls within the confidence interval for Group 2. The question is, how close can the confidence intervals be to each other and still show a significant difference?
In psychology and neuroscience, this standard is met when p is less than .05, meaning that there is less than a 5 percent chance that this data misrepresents the true difference (or lack thereof) between the means. I won't go into the statistics behind this, but if the groups are roughly the same size and have the roughly the same-size confidence intervals, this graph shows the answer to the problem Belia's team proposed:
The confidence intervals can overlap by as much as 25 percent of their total length and still show a significant difference between the means for each group. Any more overlap and the results will not be significant. So how many of the researchers Belia's team studied came up with the correct answer? Just 35 percent were even in the ballpark -- within 25 percent of the correct gap between the means. Over thirty percent of respondents said that the correct answer was when the confidence intervals just touched -- much too strict a standard, for this corresponds to p<.006, or less than a 1 percent chance that the true means are not different from each other, compared to the accepted p<.05.
But perhaps the study participants were simply confusing the concept of confidence interval with standard error. In many disciplines, standard error is much more commonly used. So Belia's team randomly assigned one third of the group to look at a graph reporting standard error instead of a 95% confidence interval:
How did they do on this task? Once again, first a little explanation is necessary. Standard errors are typically smaller than confidence intervals. For reasonably large groups, they represent a 68 percent chance that the true mean falls within the range of standard error -- most of the time they are roughly equivalent to a 68% confidence interval. In fact, a crude rule of thumb is that when standard errors overlap, assuming we're talking about two different groups, then the difference between the means for the two groups is not significant.
Actually, for purposes of eyeballing a graph, the standard error ranges must be separated by about half the width of the error bars before the difference is significant. The following graph shows the answer to the problem:
Only 41 percent of respondents got it right -- overall, they were too generous, putting the means too close together. Nearly 30 percent made the error bars just touch, which corresponds to a significance level of just p<.16, compared to the accepted p<.05.
When error bars don't apply
The final third of the group was given a "trick" question. They were shown a figure similar to those above, but told that the graph represented a pre-test and post-test of the same group of individuals. Because retests of the same individuals are very highly correlated, error bars cannot be used to determine significance. Only 11 percent of respondents indicated they noticed the problem by typing a comment in the allotted space. Incidentally, the CogDaily graphs which elicited the most recent plea for error bars do show a test-retest method, so error bars in that case would be inappropriate at best and misleading at worst.
Belia's team recommends that researchers make more use of error bars -- specifically, confidence intervals -- and educate themselves and their students on how to understand them.
You might argue that Cognitive Daily's approach of avoiding error bars altogether is a bit of a copout. But we think we give enough explanatory information in the text of our posts to demonstrate the significance of researchers' claims. Moreover, since many journal articles still don't include error bars of any sort, it is often difficult or even impossible for us to do so. And those who do understand error bars can always look up the original journal articles if they need that information. Still, with the knowledge that most people -- even most researchers -- don't understand error bars, I'd be interested to hear our readers make the case for whether or not we should include them in our posts.
Belia, S., Fidler, F., Williams, J., & Cumming, G. (2005). Researchers misunderstand confidence intervals and standard error bars. Psychological Methods, 10(4), 389-396.
Nice piece of work - it certainly helped me!
Ha ha! We're all stupid and you're smart! Well done!
Excellent couple of posts! Well done. I have posted about them on Nautilus, our author blog at Nature Publishing Group (http://www.nature.com/nature/authors/gta/index.html). I expect you can believe only too well how often this issue comes up. We now have a special paragraph in our standard "acceptance in principle for publication" letter about defining error bars, plus a stats checklist for authors on our author information website.
These two posts are very useful indeed.
"In psychology and neuroscience, this standard is p <05, meaning that there is less than a 5 percent chance that the true means are not different from each other."
Please, stop saying that. That's not what a test means (you're confusing the probability of the null hypothesis given the data - p(H0|d) with something complicated and related to the likelihood of the null hypothesis - p(d|H0)).
If you're going to lecture people about their understanding of statistics, you really should get that right.
In any case in plots such as yours what most researchers would try to assess is a more empirical idea of significance. They're used to looking at plots, and they have a notion of what well-behaved and interesting data should look like. Error bars help them do that.
However it's a problem when people can't agree on what type of error bars to use, or use the wrong ones.
Very interesting, but just because people don't understand error bars doesn't mean they shouldn't be shown. The phrase "don't understand" is misleading here; even those researchers who missed those questions surely still realize that large error bars represent less certainty, whether you are talking about 95% confidence intervals or standard errors. I think the extra information is better than none at all.
Wow, it really helps me a lot. I'm a phD student on Environmental study, and I'm learning statistic. The significant difference always bother me until I finish reading this a article, Thank you much!
And I feel Simon (#4) has his reason.
I've always been suspicious about the use of standard error, since it seems to encourage people to underestimate uncertainty.
Weirdly, when I've tried 95 and 99 percent confidence intervals, people got upset, thinking I was somehow introducing extra uncertainty.
Thanks! I was recently puzzling over a graph at a colloquium talk where the error bars overlapped a little bit and wondering whether it was statistically significant, but didn't get off my lazy butt to go find out. Now I know!
And the 25% overlap thing makes intuitive sense, too, since that implies that the distance between the two means is the same as the length of the error interval to one side.
Whoops! May I take that last paragraph back? It's 1.5 times, of course. And it doesn't make intuitive sense!
"Weirdly, when I've tried 95 and 99 percent confidence intervals, people got upset, thinking I was somehow introducing extra uncertainty."
Well, with a 99 percent confidence interval you are introducing more type II error (keeping the null when it ought to be rejected). This ultimately effects the power of your test.
... never mind, I see what you're saying now. oops.
Excellent defense of your argument. I completely agree that error bars don't belong at Cog Daily since the point is to make the science readable for everyone, whether you have a strong science background or not. The average reader just doesn't fully understand error bars, and they make things look more complicated than they need to be for the sake of practicality. Thanks for putting some good reasoning behind this!
I thought I know statistics, but now I know I don't!
FYI, there are accepted ways to show confidence intervals in within-subject designs. For example, Loftus & Masson, 1994, "Using Confidence Intervals in Within-Subject Designs".
Hi Dave and Greta,
It's my first comment in your blog, though I read it every morning while taking my breakfast. I would like to thank you for maintaining such a nice source of knowledge.
One request, as it seems you are interested in statistics.
Could you please explain the Bayesian probability/theory in some post with some nice examples as you always use to explain things.
I can feel what the theory wants to say, probably something like using past experience with general probability theory, but can't explain well.
I Went through some books/webpages though, they seem making it really harder to comprehend.
Personally, if I'm looking at a graph of data, it's typically frustrating to look at just an average, it doesn't give me any idea of the spread of the data. I take the point about certain sorts of error bars not being clearly understood on average, but I refuse to believe there isn't some way to both provide the facility that people want (with all the people clamouring for error bars, there's clearly some facility missing from the graphs you usually use) and provide it in a clear way.
Maybe provide the error bars with a clear link to an explanation? Or use some non-standard way of displaying the standard deviation (like a dotted line or a fuzzy overlay area) to give people an idea of the spread of the data without leading people to assume mistaken things about confidence intervals.
Mukit: in a very general sense, Bayesian probability theory gives you a way to quantify how much confidence you can have about your knowledge of the world. Some people view it as an extension of logic (eg. E.T. Jaynes), others just as an elegant way to do inferential statistics. If you're looking for something non-technical Ian Hacking's "An Introduction to Probability and Inductive Logic" is pretty good. Jaynes' "Probability Theory : the Logic of Science" is great but the maths can get a bit obscure.
Simon, re: comment 4--
You're right, of course, and I've corrected my explanation. It still simplifies things, but I think I'm back on the right side of the balance I always need to make between oversimplification and obfuscation.
A great discussion, but errors bars are still much abused - and as you've shown, often misunderstood. I'll rarely, if ever, let an author get away with using standard errors. For most cases, they're just trying to display the variation in their sample and the standard deviation is correct parametric. But folks don't like how large it is, so they think "Aha! I'll use the standard error and my data will look better." Sure, divide by the square root of n and it'll be tighter, but it's wrong. If one's making a statistical inference, you can get away with standard errors, but since the null hypothesis is that they all come from the same population, all the bars on your graph must use the same error bars - that's what you're testing. But editors won't be happy. So let's see those SD's! And 95% CI's are ok, too. But not on pre- and post-tests - but a 95% CI on the Difference tells me something. And all this hangs on having normally distributed data, which Mother Nature is loathe to provide. And yes, I was a fan of Alvan Feinstein back in the 70's.
I want to suggest that you expand the scope of your analysis from what your readers are likely to understand to the broader issue of what your readers are hoping to accomplish by visiting this site.
I visit this site as part of my ongoing self-education program. Self-education is a process, not just a snapshot in time. If I don't understand something important, that's a temporary problem. By visiting this site, I hope to make my ignorance even more temporary. Instead of stooping to my level of ignorance, I encourage you to show me a professional standard that encourages people like me to live up to it.
In making your argument against using error bars on your graphs, you have simply confirmed for me of the value of error bars (which I already believed in), the value of me learning about error bars (which I didn't know enough about), AND you've also shown how you can teach your readers about it in a fun way (because I enjoyed this lesson).
Here's my suggestion: Use error bars, and every other professional idiom of data reporting, but at the bottom of each chart, put a link titled "I bet you don't understand this chart." That link would point to a page of other links and text that point to various articles and postings such as your very informative error bar posting. You can do a whole series of postings on the sorts of things covered in such books as How To Lie With Statistics and How To Lie With Charts.
I suspect (>99% probability) that everyone who reads this blog (>99%) does so because they are curious and intelligent about ways we humans think and how we might think better. Maybe "researchers" are content with a loose notion of statistical thinking, but I suspect your readers have a higher standard. Anyway, I do.
"In psychology and neuroscience, this standard is met when p is less than .05, meaning that there is less than a 5 percent chance that if we obtained new data it wouldn't fit with our hypothesis (in this case, our hypothesis is that the two true means are actually different -- that men have different reaction times from women)." Isn't this statement just a reformulation of your initial claim that the significance level corresponds to the probability of the null hypothesis given the results, while Simon says that the significance level corresponds to the probability of finding the results (i.e. the difference between the two means or a larger difference) given that the null hypothesis is true (i.e. given that the two means do not really differ from each other at the population level)?
I agree with Sam (#5) that the error bars at least give some intuitive impression about the variance in the data. Without the error bars, graphs can be manipulated by altering the axes of the graphs and they might give a false impression. I think it would be a good thing if more people were aware of this problem.
Oops, I meant the p-value, not the significance level, sorry.
What I'm really trying to do is to come up with a "close enough" way to explain the concept without invoking the term "null hypothesis." Just "hypothesis" is difficult for most people, and null hypothesis is even more so. Any ideas?
I think you just have to reverse the statement. The standard is met when p is less than .05, meaning that there is less than a 5 percent chance that we would find the difference between the two conditions we have found in our experiment or even a larger difference given that the two true means are actually the same. I think it's hard to understand because we actually want to know something about the probability of the null hypothesis given our data, so it would be more logic the other way around.
I think the meaning of the phrase "some effect is statistically significant" might be at least as hard to understand as the meaning of error bars in graphs. So this might be a good reason to include error bars.
Peter has it (almost) right, except it's even more complicated. If there were no difference, and if we were to do that experiment a zillion times, then our actual measured result would be in the top 5%. Of course, that's not what we really want to know. What we really want to know, is as Peter points out, the probability that our results were "due to chance". That's unfortunately not what we get from a test.
I don't know about other people, but no matter how hard I try, I fail to carry this idea across to most students who took statistics classes in a psych department. It's depressing.
Peter / Simon:
I think I've finally come up with a correction that gets it right. Sorry for all the confusion so far, and thanks for your persistence in setting me straight.
I'm trying to make sense of confidence intervals, and I'm not sure you are consistent within your explanation.
In the first portion of the article you explain confidence interval as "if we repeatedly studied a different random sample of 50 women, 95 percent of the time, the true mean for all women will fall within the confidence interval." This agrees with what I read elsewhere.
However, in the very next paragraph you seemingly change tunes. "If Group 1 is women and Group 2 is men, then the graph is saying that there's a 95 percent chance that the true mean for all women falls within the confidence interval for Group 1, and a 95 percent chance that the true mean for all men falls within the confidence interval for Group 2." This however does not agree with how I understand things.
Doesn't it actually mean something more along the lines of: "The graph presents two ranges, for each of which 95% of the intervals calculated like these would contain the true mean of women and men." Given that an interval either contains or does not contain the true mean, whereas the original statement suggests that 95% of the intervals constructed this way contain the true mean.
Goodness knows that I didn't come up with this on my own. Instead, as frequently is the case, the internet suggested that the confidence interval is frequently mistated that way. The clearest version I found was that "While 99% of 99% confidence intervals contain the parameter, we don't know if this particular interval contains it."
In my own search I found Cumming and Finch (2005) (perhaps the inspiration for Belia's work) to be informative.
Cumming, G. & Finch, S. (2005). Inference by eye confidence intervals and how to read pictures of data. American Psychologist, 60 (2), 170-180.
It's funny that Simon makes the comment about psych students and stats. I always understood p value to be the likelihood that we got the results by pure chance and I was taught that in a psych stats class.
My vote is to include. As another poster also said, I visit the site (and scienceblogs in general) to learn.
Wonderful post! As a follow-up to the discussion of repeated-measures/within-subjects error-bars (EBs): omitting EBs or CIs just because the data is repeated DOES seem like a cop-out, if only because it's pretty easy to make them correctly.
Here's a link to a few relevant articles, including both Belia et al '05 and the definitive article by Masson & Loftus on how to correctly draw repeated-measures EBs (http://psyphz.psych.wisc.edu/%7Eshackman/mediation_moderation_resources…).
Here's a link to an article where we applied this technique to real data (http://psyphz.psych.wisc.edu/web/pubs/2006/Shackman.AnxietySelectively…).
Hope this helps.
Wonderful explanation why Confident Itervals replaced Standard Error graphs
I am one the said researchers you are talking about, I don't understand statistics much, and I have a problem,I do not know whether or not to add error bar on the graph that I am presently making. I am studying the mRNA stability of actin. I repeated the expriment thrice taking readings at different time points and noting the mRNA level. Since the sample size is small, should the error bar be plotted on the graph. I have seen few who add and still others who don't. Can you help?
Brave of you to take this on.
BTW. Pat is right--the correct way to interpret intervals is, for example, "in many many samples, 95% of the estimated intervals will contain the true mean", not "there is a 95% probability that the true mean lies within this particular estimate".
My own preference for showing data is to show it. Plot ALL of your points and overlay a box-plot. You'll find out just how confident you are about your results if do that.