Statistics is something that surrounds us every day – we’re constantly

bombarded with statistics, in the form of polls, tests, ratings, etc. Understanding those statistics can be an important thing, but unfortunately, most people have never been taught just what statistics *really* mean, how they’re computed, or how to distinguish the different between

statistics used properly, and statistics misused to deceive.

The most basic concept in statistics in the idea of an *average*. An average is a single number which represents the idea of a *typical* value. There are three *different* numbers which can represent the idea of an average value, and it’s important to know *which one* is being used, and whether or not that is appropriate. The three values are the mean, the median, and the mode.

The mean is what most people are taught as the average in middle school math. Given a set of values, the mean is what you get by adding up all of the values, and dividing that sum by the number of values. Written in

math notation, you have a set {x_{1}, …, x_{n}} of values called the *population*. The mean is:

(Σ

_{i ∈ 1..n}x

_{i})/n

The mean is a very useful number – it summarizes the properties of the *group*. It’s important to understand that the mean does *not* represent an individual – in fact, there may be no individual whose value matches the mean; but the mean is a summary of *the entire population*.

The median is often a better representative of a *typical* member

of a group. If you take all of the values in a list, and arrange them in

increasing order the number at the center will be the median. The median

is an actual value belonging to some member of the group – depending on the distribution of values, the mean may not be particularly close to the value of any member of the group; and the mean is also subject to *skew* – as few as *one* value significantly different from the rest of the group can dramatically change the mean. The median gives you a central member of the

group without the skew factor introduced by outliers. If you have a *normal* distribution, then the median value will be a *typical* member of the population. (“Normal” here is a technical term that I’ll explain in a later post; for now, just take it to mean “not a totally weird distribution”.)

The *mode* is the most common member of the group. It doesn’t matter whether it’s the biggest or smallest value in the group – whatever value is *most common* is the mode. The most is the least commonly used of these three average measures, and that’s because it’s generally the least meaningful. But once in a while, it’s useful. If your data is perfectly

regular, then the mean, median, and mode will all be the same value. This

almost never happens in real life. In general, you’ll find that the median is in between the mode and the mean, closer to the mean.

Let’s look at a couple of examples. Suppose we wanted to know something about the income of a population of people. (Just to be clear, the following numbers are completely made up to give me good examples of how these three measures differ, not because they are in any way representative of real income distribution!) We’ll imagine a group of 24 people. In order, they make

$20,000, $20,000, $22,000, $25,000,

$25,000, $25,000, $28,000, $30,000,

$31,000, $32,000, $34,000, $35,000,

$37,000, $39,000, $39,000, $40,000,

$42,000, $42,000, $43,000, $80,000,

$100,000, $300,000, $700,000, and $3,000,000.

The *mean* income of this group is about $200,000 (The sum of the incomes is $4,789,000; there are 24 members; so the mean is 200,000).

For the median, we arrange the values in order, with half the values on one side, and half on the

other. To make it all fit, we’ll write the number of thousands, without the trailing “000”:

20, 20, 22, 25, 25,

25, 28, 30, 31, 32,

34,

35, 37,

39,

39, 40, 42, 42, 43,

80, 100, 300, 700, 3000

The median is the value with the same number of things on either side of it. In this case, our population has an even number of members, which means we need to pick one of the two. The correct thing to do in a case like this depends on what you’re doing – in general, you try to do the *conservative* thing, which means choosing the one that is *least* likely to skew the data in favor of a particular conclusion. In this case, since we’re looking at uneven salary distributions, skewing it *upwards* (picking the larger of the two) is going to reduce the imbalance, but certainly not eliminate it; skewing it downwards will make the difference even larger, which will exaggerate the results. So we’ll pick the higher of the middle values: 37,000.

The mode is $25,000.

This case demonstrates the skew effect of the mean quite clearly. The

value of the mean is larger than 80% of the actual members of the group! Even

with our selecting a the larger of two possible values of the median, the mean

is more than five times larger than the median!) The median is a much better

measure of a *typical* member of a group. In this case (as is pretty

common) the mode is not particularly meaningful.

Some common tricks that people use with statistics is using the median where the mean is more appropriate, or the mean where the median is more

appropriate.

For example, it’s a pretty common trick when talking about

incomes to talk about how the *mean* income of a large group of people

has increased – when in fact, the typical member of the group did *not*

get any raise – instead one or two outliers got huge raises, and everyone else

got nothing. Suppose that you had ten employees, and you gave them pay-changes of -2%, -2%, 0%, 0%, 0%, 0%, 1%, 1%, 3%, 20%. The *mean* salary change would be +2%. But half of the employees saw either no change or a decrease; and in fact, almost all of the increase went to just one person. Take that one person out, and the average raise drops by nearly a factor of *20* to 0.11%.

Another example of misuse the other way: there’s a creationist who published a book under the pen-name of John Woodmorappe explaining how Noah’s ark could have actually held all of the animals. He did this by computing the *median* size of a species, and then multiplying that by the number of species. This is wrong, because the median is simple not an appropriate measure for talking about the population as a whole; it identifies a *typical* member of the population, but it doesn’t extrapolate well.>/p>

To see why, let’s look at a specific example. Imagine we had a group of

animals, and their masses were: 10@0.02kg, 10@0.1kg, 20@0.2kg, 20%0.3kg, 5@1kg, 5@2kg, and 10@5kg. The *total* mass is 76.2kg. There are 80 individuals; their *mean* mass is about 0.95kg; their *median* mass is (leaning toward the high side) 0.3kg. If you use the mean to estimate the mass of the population, you’ll get 75 (because of rounding errors). If you use the median to estimate the mass of the population, you’ll get 24kg – less than 1/3 the correct value. Woodmorappe used this trick to try to make it look like you could fit more animals on the ark than you actually could – as you can see from the example, using the median to reason about the population as

a whole can give you ridiculously wrong answers.