I'll bet you still don't understand error bars

By dmunger on July 30, 2008.

Cognitive Daily gets a lot of complaints about graphs, mostly from readers who say the graphs are useless without error bars. My response is that error bars are confusing to most readers. But perhaps I'm wrong about that. Last year I posted about this issue, and backed it up with a short quiz about error bars, which most of our readers failed. After another 16 months of Cognitive Daily, maybe they've improved. So here's the test again.

Take a look at this graph. It represents a fictional experiment where two different groups of 50 people took a memory test. The mean scores of each group are shown, along with error bars showing standard error:

Based on this graph, can you tell if there is a significant difference (p.05) between the scores of the two groups? Let's make this a poll (for the sake of accuracy, please respond as best you can even if you don't know what the error bars represent).

Below I've included a similar graph, again testing two different groups of 50 people but using a different type of error bar:

Again, based on this graph, can you tell if there is a significant difference (p.05) between the scores of the two groups?

I'll bet that we will still get a large number of incorrect responses to each poll, even though many of our readers are active researchers, and even though I already posted the same quiz last year.

The Bet
Last year, I offered this wager. I said that fewer than 50 percent of our readers could accurately answer the poll questions without cheating or looking up the information elsewhere first. If we got more than 300 responses to each poll, and accuracy was better than 50 percent for each, then I would add error bars to every graph I produce for Cognitive Daily from there on out (as long as the researchers publish enough information for me to generate them -- and as long as the error bars are statistically relevant [more on that later]). If not, then I would get to link to this post every time a commenter complained about Cognitive Daily not putting error bars in its graphs.

What's a reasonable wager the second time around? Should I still keep it at 50 percent, or should I up the ante, since readers should have learned last time?

If you'd like to know the answers to the quiz, you'll have to check last year's post.

More like this

APS 2008: Can we learn from errors? What if we're running a nuclear power plant?

Just a few quick notes about Michael Frese's talk, "Learning from Errors by Individuals and Organizations." Frese gives a rule: "You make about 3-4 errors per hour no matter what you're doing."

A Quantum Bogosity Updated

One of the coauthors on the paper which I claimed was shoddy has written a comment in the original post. Which merits more commenting!

Preventing injuries during surgery due to technical mistakes

You've probably heard the oft-repeated

How much difference can one coding error make?

In his statement on the coding errors Lott tries to downplay the significance of the errors:

The question isn't whether we readers collectively know 100% of the information conveyed by the error bars, it's whether overall the error bars add to our understanding of the graphs.

I may, in the future, forget the exact definition of what the error bars mean, but I will still be capable of saying "Whoo, small error bar, that figure is probably pretty accurate" and "Whoa, look at that huge error bar, I'll use a bigger grain of salt to look at that figure".

Why are you deliberately witholding information from those readers who do know what they're for? Surely those of us that don't know what they're for will just ignore them anyway for the most part?

I agree with Sharon. Why not include them, as well as a note that says, "Not sure how to read error bars? Read this [LINK]." That way your ignorant but motivated readers stand a chance of becoming more enlightened. If, as I suspect, it's just too much of a pain in the butt to include them, that should be a sufficient excuse without blaming us ignoramuses.

is it possible that this isn't the best way to convey the informatino? maybe a different type of graph would work better. i've not seen error bars on bar graphs before.

My thoughts on error bars:
1. It is absolutely necessary to have error bars when graphically representing data, as without it there is no context with which to interpret the numbers. Are they different? One cannot know this without error bars.
2. I have found (and have used in all of my publications) that the best way to use error bars is to plot 2 Standard errors on either side of a mean. This tends to give the naive reader the impression that there is a lot of variation, but it is the easiest to interpret for the educated reader. For a basic t-test situtation, there is no significant difference if the 2 standard error bar of one sample overlaps the mean of the other sample. Many folks don't like this though, as they want to give the impression of little variation in their data and thus prefer to use only one standard deviation.

But again. For context with which to interpret data, error bars of some sort must be included.

I'm one of those agitators for error bars, and I also strongly agree with Sharon's post. The test here is not very meaningful, since it just shows that people cannot do complicated maths in their heads. If you were really worried about people making poor mental conversions from graphical error bars to significance testing at a particular percentage, you could easily make a "key" graphic showing the degree of separation that is exactly equal to .05 or whatever arbitrary significance you like.

But the bottom line is that more information is better than less! Without error bars, you can't even do a poor conversion in your head; you can't do any comparison whatsoever.

PS: I do kind of worry that you might take our criticism personally. I really like your blog and read every one, even if I mostly post comments to gripe about things. With or without error bars, I'll still be a loyal reader.

I also agree with Sharon. In the sample graphs even the most ignorant reader can see that the two bars are equally accurate, and this is useful information in itself. Confidence limits are even better, as long as the confidence level is the same in every graph.

I also agree with Sharon and hilllady. Challenge our minds!

The calculation needed to test if the difference is significant is indeed too complicated for most people to do in their heads. Be reminded that SE don't add up, but variance do. So standard error of the difference is
SE(m1-m2)= square root of(SE1^2+SE^2)

that means, you should not see if the confidence interval of group 1 contains mean of group 2, since the latter is considered random! You should measure if the difference is more than 2 times the SE(m1-m2) above. Geometrically it's the diagonal of a rectangle whose sides are the group SE's.

Alright it's impossible to tell by just looking at the graph.

BTW nice blog.

hilllady makes a great suggestion.

If you do this, many of us will surely learn how to read error bars properly. I know I would after several posts. And wouldn't that be splendid?

I got the first one but missed the second. I call shenanigans on the question, however. The graphs are the same color, size, everything. However, the definition of what the error bars were measuring was changed from standard error to 95% confidence, and I totally failed to notice the small-print change. My fault, I know, but I think people would do a lot better on this test if the change in the definition were noted more clearly.

I have posted a comment on the appropriate use of error bars on my blog : http://skeetersays.blogspot.com/2008/07/on-use-of-error-bars.html

Sharon's comment (#1) scared me a bit and I thought I would elaborate more on my above comment (#4) with a few examples.

This is a big deal that I agree needs to be fixed. Understanding error (bars) is essential to understanding the statistical analysis of any data. We need to correct this problem in any way we can.

Matt, I just made the same mistake! I'm sure many people that know how to read error bars perfectly well will similarly miss the change.

Matt and Wilbur: I think maybe that was the whole point of the exercise: to show that if you don't look at the small print, which defines what kind of error bar you're looking at, then you can't be sure to interpret it correctly. And that' probably the major problem with error bars: too many people just don't make the effort of checking what kind they are, so they think they know what it means, but in fact there's a big chance that they don't. Reading error bars "perfectly well" requires reading the small print.

Why not a bar-and-whisker plot rather than a plot that makes the possibly misleading assumption that the data is evenly clustered about the mean?

Hey, I'm still trying to figure out why you'd plot a mean score as a bar instead of as a point.

BAllanJ - (#15)

I have been having that discussion with my PhD advisor for years.

To me bars represent counts of things - frequencies of events, whereas points are better suited for mean values of groups. My adviser prefers bars because they 'look better'.

A flaw in your logic is that it is not just your readers that misunderstand error bars:

Belia, S., Fidler, F., Williams, J., & Cumming, G. (2005). Researchers misunderstand confidence intervals and standard error bars. Psychological Methods, 10(4), 389-396.

So the problem is widespread among researchers, let alone your more varied audience. Yet one doesn't find guidelines like the APA Publication Manual throwing up its hands and saying "don't bother with error bars, most people don't get them".

In science communication, whether formal or not, we should focus on excellence in presenting the data and its interpretation, rather than focussing on the perceived mediocrity of the audience. Error bars don't act as a barrier to anyone who can't interpret them, but their absence detracts from those who can.

And you are focussing on just one issue, which is whether people can interpret significant differences in between-group comparisons. Generally it is a given that the differences are significant as that is why they are being presented. In presentation rather than analysis, error bars are more often useful as a guide to the variability of the data.

The question for the producer of a graph then is not so much "should I use error bars" as "what should my error bars represent"?

Standard deviations are useful to show the spread of the data as they are independent of sample size. Range/ interquartile range, etc. may also be useful. Confidence intervals are appropriate when we are concerned to show how precisely we have estimated the mean.

Standard errors in themselves are not very informative. I think we all know the only reason they are used is because they are by definition smaller than their associated standard deviation or confidence interval, and hence give a deceptively favourable impression of the variability of the data.

A very useful guide to those seeking to improve their reading of error bars by eye is:

Cumming, G., & Finch, S. (2005). Inference by eye: confidence intervals and how to read pictures of data. American Psychologist, 60(2), 170-180.

It may be easier to conceptualize standard error (SE) and confidence interval (CI) via the concept of signal-to-noise, whereby signal and noise are defined slightly differently in each circumstance.

For SE:

1.Signal = difference in the mean of the indicated groups
2.Noise = the largest indicated SE range
3.Significance if Signal > Noise

For the first graph:

1.Signal = 74-50 = 24

2a.Noise(Group 1) = ~83 - ~65 = 18
2b.Noise(Group 2) = ~59 - ~41 = 18

3.Signal > Noise, so it should be significant but the difference is fairly close if one is just eyeballing it.

For CI:

1.Signal = larger mean (and its CI)
2.Noise = smaller mean (and its CI)
3.Significance if Signal > Noise

For the 2nd graph:

1. Signal = [59]-(70)-[81]
2. Noise = 39-(50)-61

3. 39-(50)-[59]:61-(70)-[81]

If one imagines this as the normal distribution of two populations (with their means and CI representing the tails), there's only a slight bit of interacting noise (59 & 61) at the tails, with the primary signal (70) lying well above the primary noise. Signal > Noise so this should be pretty significant.

Hi
How about error bars for various curves such as the parabolic ones?

Huh, isn't there a strong normality assumption in your previous question answers?

I thought an error bar is when you're supposed to meet with some friends, but end up in the wrong pub. Then, of course, the confidence level (that you're in the right pub and they're in the wrong pub) is proportional to how much you drink whilst waiting for them....

I'd like to have a go, as a non-practising scientist (IYKWIM).

I reckon the first one is not significant because the overlap point is only about one-and-a-bit standard deviations from each of the data points. This doesn't look good (but I don't know how to evaluate it absolutely).

The second one is *probably* significant because the overlap is short, so the chance of the overlap being the real result is, say 92% probability (on the first datum) multiplied by, say 88% on the other. This is less than 95%, and so is significant even though the error bars overlap. I describe it this way as I tend to think of 0.95 SD bars as Gauss curves drawn to the 95th percentile.

Someone who is currently using stats care to describe any errors in the above?

can we have a post on how to use error bars please?

@19: Assuming this is a reference to mean comparisons with more than 2 groups, the same logic would apply, but using F-test rather than t-test logic. F-tests allow for trend analyses (e.g., whether the means exhibit linear, quadratic, or cubic patterns).

@20: SE and CI are based on t and z population distributions, which assume normality.

@21: That's funny stuff. :)

@22: SE = SD/(Sqr[n])
: .95CI = Xbar +/- 1.96(SE)

Those are the formulas for determining significance in the context of SE and CI; so a significance determinance via eyeballing is virtually impossible.

Well, standard error is standard error. It's just a measurement. CI can be calculated using any statistical model. The problem is that using standard errors for hypothesis testing requires model assumptions, and so for example the answer Dave gave last time probably uses assumptions like normality. Confidence intervals are a bit better in that regard since the modelling considerations have already been taken into account.

I wouldn't say that eyeballing is impossible. In the marginal cases, yes, it's difficult to judge. But there's certainly rules of thumb. None-overlapping CIs are pretty obvious to anyone.

In addendum, the problem is that CI gives a region of data that the true value is 95% likely to be in. But it doesn't give you any other info.

To interpret from 95% CI where the 90% CI could be, or indeed the whole curve of probabilities for the data is not possible, unless you put in strong assumptions about how the CI was calculated and the actual model the data obeys. For the second graph, for example, the CI might disguise the truth that the real data is two point masses on the ends of the CI, in which case there is a 25% chance that A is actually lower than B.

Quite possibly I'm being overly pedantic, though. Test scores are generally pretty normal.

Right. My understanding of SE and CI (I could be mistaken it's been awhile since stats class), is that they are based on normalizing skewed distributions via conversion of raw scores into standard z-scores? So I guess if the scores weren't standardized, you'd get the problem you're describing?

i have to echo a lot of other comments here: you appear to have completely missed the point of error bars! your quiz questions are designed to test technical knowledge irrelevant to what the error bars are designed for: quick visualization of the distribution of the data.

it's kind of disappointing to see that this point was made in comments to your post last year too, and you haven't responded to it either time. i also have to say that my favorite comment from last year's, posted by "Sober", was: "I wonder how many people would have got the correct results if you handn't shown any error bars." heh. that's not a throw-away joke, by the way. it's a very clever and succinct way of explaining how informationally impoverished bar graphs are without some indication of distribution.

also, a note about paired tests: it's true that the way the chart is arranged, the group error bars are meaningless. however, it is not at all true to say that error bars are not important for visualizing the underlying data properly. the way to plot these results (a test-retest difference, for example), is to plot the mean of difference scores (i.e., what's the average change?), preferably with a confidence interval, which shows you quite easily whether zero falls within the range--or in comparing groups, allows you to see at a glance, "wow, that's a huge effect!" or "wow, those distributions are almost completely overlapping!"

i think you mentioned that some journals don't provide error bars in their figures.... which journals would those be? i'm a neuroscience grad student, and no journal that i read would publish figures like these without error bars of some kind. and probably because of that ubiquity in the literature, honestly, when i see bar graphs without error bars, my very first thought, before anything else, is: either the author's too scared to show them because they totally undermine the impact, or the author doesn't know what they are, in which case why should i care what they claim to have found? like someone else said, if it's just two bars on a chart, you're not giving any more information than a table with two numbers in it. it just looks kinda naked and ridiculous, or worse, like an attempt to mislead.

sorry, this has become a long rant. but if the above comments have not made it clear to you what we readers use error bars for (hint: it's not to confirm the p-value already calculated by the researchers), then please speak up so we can clarify! the beauty of reading a blog like this is being able to see cool results culled togethor and summarized *without* having to dig up the original source to see the non-crippled figure.

but if the above comments have not made it clear to you what we readers use error bars for (hint: it's not to confirm the p-value already calculated by the researchers)

You might want to take this quarrel up with the authors of the study assessing understanding of SE and CI among professionals, on which Dave's comments appear to be based. It appears their underlying assumption is that professionals are using it to determine significance rather than data distribution.

well you know, i might, except that...

1) they aren't saying error bars are useless or advocating their removal (as Dave noted).
2) they aren't making an editorial policy decision that will affect all future studies they report to me.
3) they don't have a blog with a comment section. :-)

but otherwise, yes, i agree that the paper's authors assume that making statistical inferences is *one* of the ways people use error bars. however, the authors themselves state that this is not the best way to think of error bars, and only chose to investigate it "because of the current dominance of null hypothesis statistical testing and p-value." thus, the analysis was not motivated by any *evidence* that this is actually what people try to do in reading error bars.

the convention is to plot data with error bars, and show significance testing either in the figure with asterisks or other marks, or in the figure caption. this convention exists because these two forms convey different information about the data, and both bits of information are useful.

"Error bars" (technically show as lines) are a poor and ambiguous graphic fix to a larger data representation problem. Graphs show numeric results without statistical evaluation.

Error bars drawn as black lines expose the "Grey area" in a much more subtitle manor than the fat bold colored graph leaving impressively appearing differences in statistically questionable results.

A better graphic solution is to drop the bar graph reporting the data points. Only show error bars in a bold wide colorful style and standardize them as showing 95% confidence, Place a thin line at actual results.

The style you are using is better reserved for other things such as showing market volatility on a line graph of stock closing prices.

If the purpose of error bars is to give some indication of the spread of the data, why not throw out the bar chart and use a box and whisker chart? This provides a much better sense of the distribution's spread and skewness than error bars.

If your problem is with people who falsely interpret the *standard error* as a quantity, then by all means adopt some other measure which will be correctly interpreted. You'll know what that measure is when people look at the lines and make correct inferences from whether or not they cross, or read the numbers in a table and make correct inferences from the numbers.

But if your problem is with, not the quantity, but the visual devices alone, then completely omitting them from a column graph is just a counsel of despair, and omitting them *while keeping the far more egregious columns themselves* is getting into flea-with-camel-chaser territory. After all, if columns with error bars can't be interpreted as significantly different, then columns without error bars sure can't be.

Hence, if you think you've proven that people can't read graphs, have the courage of your convictions and stop drawing graphs. Let them read tables instead.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

Cognitive Daily Closes Shop after a Fantastic Five-Year Run

January 20, 2010

Five years ago today, we made the first post that would eventually make its way onto a blog called Cognitive Daily. We thought we were keeping notes for a book, but in reality we were helping build a network that represented a new way of sharing psychology with the world. Cognitive Daily wasn't the…

Both musicians and non-musicians can perceive bitonality

January 20, 2010

Take a listen to this brief audio clip of "Unforgettable." Aside from the fact that it's a computer-generated MIDI performance, do you hear anything unusual? If you're a non-musician like me, you might not have noticed anything. It sounds basically like the familiar song, even though the…

Synesthesia and the McGurk effect

January 14, 2010

We've discussed synesthesia many times before on Cognitive Daily -- it's the seemingly bizarre phenomenon when one stimulus (e.g. a sight or a sound) is experienced in multiple modalities (e.g. taste, vision, or colors). For example, a person might experience a particular smell whenever a given…

Does watching TV really kill you?

January 12, 2010

Today I had to put off my normal morning run in order to make time to be interviewed on a radio show at 7:30 a.m. As I waited on hold for the interview to start, I could hear the hosts joking back-and-forth about what the "latest TV controversy" is. "Is it the Jay Leno / Conan O'Brien news on NBC…

The outfielder problem: The psychology behind catching fly balls

January 7, 2010

It's football season in America: The NFL playoffs are about to start, and tonight, the elected / computer-ranked top college team will be determined. What better time than now to think about ... baseball! Baseball players, unlike most football players, must solve one of the most complicated…