When we look at a the data for a population+ often the first thing we do
is look at the mean. But even if we know that the distribution
is perfectly normal, the mean isn’t enough to tell us what we know to understand what the mean is telling us about the population. We also need
to know something about how the data is spread out around the mean – that is, how wide the bell curve is around the mean.
There’s a basic measure that tells us that: it’s called the standard deviation. The standard deviation describes the spread of the data,
and is the basis for how we compute things like the degree of certainty,
the margin of error, etc.
Suppose we have a population of data points, P={p1,…,pn}. We know that the mean is
the sum of the points (pis) divided by the number of
points |P|. The way to describe the spread is based roughly on the concept
of the average difference between the points from the mean.
So what happens if we naively compute the average of the difference between the mean and the the data points? That is, compute the mean difference? That is – if the mean is M, and the average distance is d, then can we use the following?

Unfortunately, that won’t work. If we work it through, what we’d find is that by the definition of mean, that average difference d will
be 0. After all, the mean is the point in the center of the
distribution – that means that a simple sum of the differences will be zero – the values larger than the mean (which will be positive) will be precisely equal to the sum of the values smaller that the mean (which will be negative), and so the sum, and therefore the average must be 0.
How do we get around that? By making all of the distances positive. And how do we do that? Square them. The standard deviation, which is usually written σ is a root mean-square measure – which means that it’s the mean (average) of
the square root of the difference between the points and the mean squared. The sum of the squares is also a useful figure, called the variance; the variance is just the mean of the squares – that is σ2. The standard deviation written in equational form, where M is the mean, and P is the set of points, is:

Let’s run through an example. Take the list of salaries from the mean article: [ 20, 20, 22, 25, 25, 25, 28, 30, 31, 32, 34, 35, 37, 39, 39, 40, 42, 42, 43, 80, 100, 300, 700, 3000 ]. The sum of these is 4789. There are 24 values. So the mean (rounding off to 2 significant figures) is 4789/24 = 200. So what’s the standard deviation?
- First, we’ll compute the sum of the squares of the differences:
(20-200)2 + (20-200)2 + (22-200)2 + (25-200)2 + … + (700-200)2 + (3000-200)2 =
32400+32400+31684+30625+30625+30625+29584+28900+28561+28224+27556+27225+26569+25921+25921+25600+24964+24964+24649+14400+10000+10000+250000+7840000 = 8661397. - Then we’ll divide by the number of points: 8661397/24 = 360891. So the variance is roughly 360,000.
- Then take the square root of the variance: the square root of 360,000 = 600.
So, for our salaries, the mean is $200,000 with a standard deviation of $600,000. That right there should be enough to give us a good sense that there’s something very strange about the distribution of numbers here – because salaries can’t be less than zero, but the standard deviation is three times the size of the mean!
But what does the standard deviation mean precisely? The best way to define it is in probabilistic terms. In a population P with roughly normal distribution, mean M, and standard deviation σ:
- 2/3s of the values in P will be
within the range M +/- σ. - 95% of the values will be within the range M +/- 2σ.
- 99% of the values will be within the range M +/- 3σ
For any population P with mean M and standard deviation σ, regardless of whether the distribution is
normal:
- At least 1/2 of the values in P will be within the range M +/- 1.4σ.
- At least 3/4 of the values in P will be within the range M +/- 2σ
- At least 9/10s of the values in P will be within the range M +/- 3σ.
If you have a population P which is very large, you often want to make
an estimate about the population using a sample, where a sample
is a subset P’ ⊂ P of the population. Since the standard deviation of the sample is generally slightly smaller than the standard deviation of the population as a whole, we add a correction factor for sampled populations. In the equation for the standard deviation, instead of dividing by the size of the sample, |P’|, we divide by the size of the sample minus one: |P’|-1. The ideal correction factor is a lot more complicated, but in practice, the “subtract one from the size of the sample” trick is an excellent approximation, and so it’s used nearly universally.
Next topic in the basics will be something closely related: confidence intervals and margins of error.