Seed Media Group

Search this blog

Profile

markcc.jpg
Mark Chu-Carroll (aka MarkCC) is a PhD Computer Scientist, who works for Google as a Software Engineer. My professional interests center on programming languages and tools, and how to improve the languages and tools that are used for building complex software systems.

Other Information

Add this blog to my Technorati Favorites!

Recent Posts

Recent Comments

Categories

Blogroll

Old Topic Indices

Great Online Books

« Back to the Basics? | Main

Basics: Mean, Median, and Mode

Category: Basicsstatistics
Posted on: January 15, 2007 8:00 AM, by Mark C. Chu-Carroll

Statistics is something that surrounds us every day - we're constantly bombarded with statistics, in the form of polls, tests, ratings, etc. Understanding those statistics can be an important thing, but unfortunately, most people have never been taught just what statistics really mean, how they're computed, or how to distinguish the different between statistics used properly, and statistics misused to deceive.

The most basic concept in statistics in the idea of an average. An average is a single number which represents the idea of a typical value. There are three different numbers which can represent the idea of an average value, and it's important to know which one is being used, and whether or not that is appropriate. The three values are the mean, the median, and the mode.

The mean is what most people are taught as the average in middle school math. Given a set of values, the mean is what you get by adding up all of the values, and dividing that sum by the number of values. Written in math notation, you have a set {x1, ..., xn} of values called the population. The mean is:

i ∈ 1..n xi)/n

The mean is a very useful number - it summarizes the properties of the group. It's important to understand that the mean does not represent an individual - in fact, there may be no individual whose value matches the mean; but the mean is a summary of the entire population.

The median is often a better representative of a typical member of a group. If you take all of the values in a list, and arrange them in increasing order the number at the center will be the median. The median is an actual value belonging to some member of the group - depending on the distribution of values, the mean may not be particularly close to the value of any member of the group; and the mean is also subject to skew - as few as one value significantly different from the rest of the group can dramatically change the mean. The median gives you a central member of the group without the skew factor introduced by outliers. If you have a normal distribution, then the median value will be a typical member of the population. ("Normal" here is a technical term that I'll explain in a later post; for now, just take it to mean "not a totally weird distribution".)

The mode is the most common member of the group. It doesn't matter whether it's the biggest or smallest value in the group - whatever value is most common is the mode. The most is the least commonly used of these three average measures, and that's because it's generally the least meaningful. But once in a while, it's useful. If your data is perfectly regular, then the mean, median, and mode will all be the same value. This almost never happens in real life. In general, you'll find that the median is in between the mode and the mean, closer to the mean.

Let's look at a couple of examples. Suppose we wanted to know something about the income of a population of people. (Just to be clear, the following numbers are completely made up to give me good examples of how these three measures differ, not because they are in any way representative of real income distribution!) We'll imagine a group of 20 people. In order, they make $20,000, $20,000, $22,000, $25,000, $25,000, $25,000, $28,000, $30,000, $31,000, $32,000, $34,000, $35,000, $37,000, $39,000, $39,000, $40,000, $42,000, $42,000, $43,000, $80,000, $100,000, $300,000, $700,000, and $3,000,000.

The mean income of this group is about $240,000 (The sum of the incomes is $4,789,000; there are 20 members; so the mean is ). The median is $34,000 (when the number of values is even, we generally just randomly pick whether to go for the higher or the lower number. Depending on what you're doing, you can either pick the higher number, the lower number, the number closer to the mean, or flip a coin). The mode is $25,000.

This case demonstrates the skew effect of the mean quite clearly. The value of the mean is larger than 80% of the actual members of the group! Even with our selecting a the larger of two possible values of the median, the mean is about seven times larger than the median!) The median is a much better measure of a typical member of a group. In this case (as is pretty common) the mode is not particularly meaningful.

Some common tricks that people use with statistics is using the median where the mean is more appropriate, or the mean where the median is more appropriate.

For example, it's a pretty common trick when talking about incomes to talk about how the mean income of a large group of people has increased - when in fact, the typical member of the group did not get any raise - instead one or two outliers got huge raises, and everyone else got nothing. Suppose that you had ten employees, and you gave them pay-changes of -2%, -2%, 0%, 0%, 0%, 0%, 1%, 1%, 3%, 20%. The mean salary change would be +2%. But half of the employees saw either no change or a decrease; and in fact, almost all of the increase went to just one person. Take that one person out, and the average raise drops by nearly a factor of 20 to 0.11%.

Another example of misuse the other way: there's a creationist who published a book under the pen-name of John Woodmorappe explaining how Noah's ark could have actually held all of the animals. He did this by computing the median size of a species, and then multiplying that by the number of species. This is wrong, because the median is simple not an appropriate measure for talking about the population as a whole; it identifies a typical member of the population, but it doesn't extrapolate well.>/p>

To see why, let's look at a specific example. Imagine we had a group of animals, and their masses were: 10@0.02kg, 10@0.1kg, 20@0.2kg, 20%0.3kg, 5@1kg, 5@2kg, and 10@5kg. The total mass is 76.2kg. There are 80 individuals; their mean mass is about 0.95kg; their median mass is (leaning toward the high side) 0.3kg. If you use the mean to estimate the mass of the population, you'll get 75 (because of rounding errors). If you use the median to estimate the mass of the population, you'll get 24kg - less than 1/3 the correct value. Woodmorappe used this trick to try to make it look like you could fit more animals on the ark than you actually could - as you can see from the example, using the median to reason about the population as a whole can give you ridiculously wrong answers.

TrackBacks

TrackBack URL for this entry:

Comments

One of the reasons why the mean is the preferred representive of a population is that statistical tests, such as the students t test can be performed on it.

Posted by: SLC | January 15, 2007 08:17 AM

Thank you for the basic stats refresher. Interestingly statistics was the only math class I absolutely aced in college. I actually understood it. Everything else was just a blur.

Posted by: Tony P | January 15, 2007 08:23 AM

You might mention how median and mean are defined if you have a continuous probability distribution, instead of simply a number of samples.

Of course, you could assume that anyone able to use the term "continuous probability distribution" accurately already knows, but just in case:

Given a probability distribution p[x] such that:

Integrate[p[x], {x, -Infinity, Infinity}] == 1

Then the mean is:

Integrate[x*p[x], {x, -Infinity, Infinity}]

And the median is the solution L to the problem:

Integrate[p[x], {x, -Infinity, L}] == 0.5

(The notation used is Mathematica's)

Posted by: Daniel Martin | January 15, 2007 08:49 AM

Another thing I just realized: what you have there is the arithmetic mean. Occasionally, it is useful to get into other definitions of mean (geometric mean, harmonic mean, or the generalized power mean). Presumably, anyone interested in those can go look it up in wikipedia.

Related to that, one thing you might want to address is how the mean of derived quantities generally needs to be computed by aggregating the components first. An example helps to clarify this:

Suppose I drive a mile and 30mph and then another mile at 60mph. What was my average speed over the two miles?

It wasn't 45mph. It was 40mph.

Last night I saw a commercial for some anti-cholesterol medication (clearly poorly made, since I can't remember the brand name). In the commercial it claimed "[[drug name]] lowered cholesterol by an average of 30 points - that's 18%". Now, I want to know how that "18%" figure was arrived at - was it done by taking the average starting cholesterol of everyone on the medication and dividing 30 by that figure, or was it done by averaging the percentage drops everyone got? The two methods are not equivalent, and I don't trust the drug company to use the most appropriate definitions if that fails to paint their product in the absolute best light possible. (And I'm not certain either method is appropriate - I want to know the median effect of this drug, or would if I were a potential patient)

Actually, although creationist claims are easy prey around here, I'm certain that there are just tons of misleading/bad uses of statistics all over any product being promoted for money.

Posted by: Daniel Martin | January 15, 2007 09:03 AM

The newspapers can be particularly misleading when they talk about "the average employee", "the average American", etc. One can be lulled into the false sense that they're talking about somebody who actually exists.

For example, last year the financial press, who really should show some numeracy, reported that the average employee of Goldman Sachs earns half a million dollars per year, even when every secretary and janitor was taken into account. Of course this is skewed massively by the few people at the top who earn many hundreds of millions. I'd love to know what the median figure was.

Posted by: Neil | January 15, 2007 09:29 AM

If you're going for basics, I'd leave out the formula you provide. While it does add extra information, it may lead to more confusion amongst beginers than if it weren't there. Also, you use the term distribution without adequately defining it. I'd avoid the term altogether -- same goes for "normal distribution" -- in an introduction to mean/median/mode.

But one glaring error is that it's unclear (to someone who isn't familiar with the notiation, a beginer) how to calculate the mean. Your example using incomes is good, but it could be presented more clearly. For example, you could format the data in a table to show how the median is the middle number when the data are sorted. It would also help to show skew graphically, so that it's clear how the mean, median, and mode are different when the data are not normal (although don't use the term normal distribution).

You introduce the topic quite well, and the closing examples at the end are very good. But the description in the middle is a bit muttledstart off good, examples at end are good, but middle is a bit muddled. The examples at the end would have much more impact if the following suggestions are implemented.

Posted by: RPM | January 15, 2007 09:30 AM

SLC,
There is a t-test using the median or percentile instead. It's called a "ranked t-test" an some people say that's what we should always use since it's more robust to outliers. I think that's what I would use if I believed null hypothesis testing.

Posted by: BenE | January 15, 2007 09:30 AM

And my comment above would be a lot clearer if the final paragraph read as follows:

You introduce the topic quite well, and the closing examples at the end are very good. But the description in the middle is a bit muttled. The examples at the end would have much more impact if the suggestions above are implemented.

Posted by: RPM | January 15, 2007 09:32 AM

"If you're going for basics, I'd leave out the formula you provide."

Wow, a big part of the beauty of mathematics is the consision and expresivness of its language and formulas. These make it easyer to understand otherwise hard concepts, not the other way around.

Although, since we are going back to the basics, maybe a post on the basics of mathematical notation, what it means, how it's made and guidelines to make new useful notational constructs would be in order.

Posted by: BenE | January 15, 2007 09:41 AM

I'm curious about one point. I recall being taught that "mean" was a more specific and preferable term for "average." I have no recollection of median or mode being classed as different types of "average."

In the professional community, if I were to say "average," would folks ask, "Which one - mean, median, or mode?".

Posted by: Scott Belyea | January 15, 2007 09:48 AM

Nice article!

For me, it always helped to use the physical analogy--the mean is the "center of gravity" of your distribution. I.e. if you plunked your data points down on the x-axis, the mean is where they balance perfectly. (The fulcrum, if you think of it as a lever.) The median, on the other hand, is the position where you have equally many data points to the right or to the left.

On a side note, I feel the urge to point out (to anyone who might be wondering why we talk about the mode at all, if it's so useless) that the mode is much more meaningful if your data are continuous, rather than discrete. So there is a good reason for defining the thing--other than making things more confusing for students :-)

Posted by: Cat lover | January 15, 2007 09:48 AM

"Wow, a big part of the beauty of mathematics is the consision and expresivness of its language and formulas."

Yep. To anyone with cs or math background (most decent undergrad cs programs provide you with the equivalent of a minor in math)it is obvious and meaningful.
To most everyone else not so much, and for a wider audience a primer in all the neat greek letters and their meanings might be usefull.

Posted by: Mondo | January 15, 2007 10:08 AM

The big problem with including that formula is that the math markup here is so bad that it doesn't look like an equation in a book, typeset in LaTeX or written on a chalkboard. Can we please browbeat the ScienceBlogs tech people into catching up with Jacques Distler? I mean, superscripts and subscripts don't work; HTML entities get replaced in source text upon preview, making dashes and Greek letters come out funny; and I can't cite more than one URL per post without dumping my comment in the spam queue. It's, well, discouraging.

Other than that, it would be a good idea to explain the notation at slightly greater length. What is that Σ and how does it relate to a sum?

Posted by: Blake Stacey | January 15, 2007 10:32 AM

Blake:

I fixed the problem with superscripts and subscripts in comments - you should be able to use them with no trouble now. Things with URLs do get out of the mod queue - it just takes a bit of time for me to get to them. (If things like the links causing delays really bugs people, I can turn typekey back on, and then not moderate anything that comes from an authenticated user.)

The HTML entities in preview issue, there's nothing I can do about; it's deep in the guts of the MoveableType software that we're using.

Posted by: Mark C. Chu-Carroll | January 15, 2007 11:03 AM

Many thanks -- or, should I say, (thanks)n.

Posted by: Blake Stacey | January 15, 2007 11:15 AM

Oh, also:

In the paragraph beginning, "The mean income of this group", the phrase inside parentheses is unfinished. There's also a dangling "><\p>" at the end of the penultimate paragraph.

Posted by: Blake Stacey | January 15, 2007 11:19 AM

I would also like to request a refresher on basic notation, especially that used in logical propositions, such as upside-down A's and backwards E's. I had some of that in freshman calculus many years ago, but don't remember all of it.

Posted by: JimV | January 15, 2007 11:40 AM

Very nice! I hope you do standard deviation and standard error.

As to RPM's point, I think the equations are important, don't take them out!

But it would be helpful to explain some of the notation. Some people don't know or won't remember that this symbol Σ, for example, is used to represent a sum.

Posted by: Sandra Porter | January 15, 2007 12:08 PM

One area where the distinction between arithmetic and geometric mean is important is investment returns. In the geometric mean you take the n root of multiplying n data points together (see http://en.wikipedia.org/wiki/Geometric_mean for formula) An investment that goes up 20% one year and down 20% the next has an annual arithmetic mean return of 0%, but you have actually lost 4% of your beginning dollars. The geometric mean of (20%, -20%) is (1.2 * 0.8)^.5 or -2.02% which is the compound annualized return. The geometric mean is also easily convertible into logarithms and therefore continuous, rather than periodic, rates of return. Most research in financial economics deals with continuous returns rather than periodic as the math is much simpler to deal with.

Posted by: BWV | January 15, 2007 12:27 PM

A more general ensemble average discussion would be fun too! (i.e. observables and higher order distribution moments)

Posted by: jarvisjd | January 15, 2007 12:45 PM

One group of cases where looking for the mode is useful is when you have heterogeneity in the population. The easiest way for beginners to visualize this is by making a stem-and-leaf plot (they teach those in middle / high school math) for test scores in a large school, half of whose students are a "magnet school" group. In the s-&-l plot, they'd see two bulges popping out, one representing the most frequent value for the regular students and another for the most frequent value of the eggheads who've been shipped in from around the state.

Given how popular meta-analyses are these days in all fields, beginners should also know that the same can happen in looking for the effect size of a treatment. Effect sizes found among industry-funded and non-industry-funded studies, effects among one racial group vs another (like BiDil), and so on -- the mean and median will obscure this heterogeneity, but looking for and finding two modes will let people know something's up. Industry-funded studies may be vastly more likely to find support for their drugs, Europeans and Africans might not respond the same way to heart failure drugs, and a decent "average test score" at a magnet school might mean that there are haves and have-nots forming separate sub-populations on either side of the average.

Posted by: Agnostic | January 15, 2007 01:00 PM

Just a note to say that the *only* college course that had direct application in every job I ever held (I'm now retired.) was statistics.

I'll also second the request for a refresher in logical notation.

Posted by: chezjake | January 15, 2007 01:39 PM

Mark - thanks, nice post. You've been reading Huff, haven't you?

In the professional community, if I were to say "average," would folks ask, "Which one - mean, median, or mode?".

You probably wouldn't, because "average" would either be taken to be a bit vaguer, or to refer to the (arithmetic) mean: you can usually tell from the context.

Incidentally, we have a whole bunch of means to use to confuse people: arithmetic, geometric, harmonic, weighted, trimmed, and even Winzorised.

Bob

Posted by: Bob O'H | January 15, 2007 01:49 PM

The mode is most useful in the case where the distribution of your observations is bi-modal or multi-modal. This sometimes comes from combining two disparate populations into one measure.

Posted by: ruidh | January 15, 2007 01:56 PM

I used to explain mode by reminding students of
pie a la mode (that's the pie as in American Apple and the rest is French to me)
or fashionable (designer fashion)
but the analogies are no longer in Vogue so the advantage as an explanation is no longer stylish.

Seriously, the "average" is an important concept for people to understand, especially in the -est battles between states, cities, and tribes, e.g., http://ykalaska.wordpress.com/2006/03/13/richest-cities-in-the-us-bethel/

Thanks for helping.

Posted by: mpb | January 15, 2007 07:38 PM

Post a Comment

(Email is required for authentication purposes only. Comments are moderated for spam, your comment may not appear immediately. Thanks for waiting.)





Having problems commenting?

Search All Blogs

Blogs in the Network

Top Five: Most Active

  1. How sad 05.11.2008 · PZ Myers
  2. Even sleazier than the DI 05.11.2008 · PZ Myers
  3. Why Crashing Internet Polls is Neither Boring nor Stupid 05.11.2008 · Greg Laden
  4. Comcast: Cap bandwith? 05.11.2008 · Orac
  5. 42 05.11.2008 · Coturnix

Top Science Stories

powered by SEED - seedmagazine.com