The Basics of Statistics I: The Normal Distribution

So here's the first post on statistics. If you know the basics, and I suspect most of you do, then you can just ignore these posts (unless you want to check to make sure I'm getting it right). If you don't know the basics, then hopefully you will when I'm done. Even for those of you who've never taken a stats class, much of this will probably be familiar, but I'm going to start from the assumption that I'm writing for someone who has no knowledge of statistics whatsoever, so bear with me. Alright, let's begin.

The Normal Distribution

In cognitive psychology, two related types of statistics are used: descriptive and inferential. Descriptive statistics are just what the name says: descriptions of data. Inferential statistics are used to draw inferences about populations from samples. Since the two are related, I'm going to talk about them both pretty much at the same time. And to do that, we have to start with the normal distribution, and Central Limit Theorem (CLT).

In essence, CLT says that if you have a bunch of independent, random variables that (additively) determine the value of another variable, then as long as you meet a few constraints (particularly, finite variance, but that won't make sense until we get to variance), then the distribution of that variable will be approximately normal. Take, for example, height. A person's height is determined by a bunch of independent random variables like genetics, nutrition, and the amount of solar radiation (maybe?), so at least within a population (say, the adult population in the United States), height will tend to be normally distributed. That is, if you calculate the number of people at each particular height, and then graph those frequencies (represented as probabilities) for one gender, you'll get a graph that looks something like this (from here):

That's the classic "bell curve," or the normal distribution.

Now the reason CMT is important is because it lets us (by us, I mean psychologists) assume the normal distribution in most cases, and that's important because the normal distribution has certain well known properties that make it excellent for computing both descriptive and inferential statistics. We'll start with measures of central tendency. There are three basic measures of central tendency":

  • The mean, which is just the average. I'm sure you know how the mean is computed, but just in case, you compute it like this:

    μ = ΣX/N

    Where μ is the mean, ΣX is the sum of all of the instances of variable X, and N is the number of instances. Put more simply, the mean is just the sum of all the instances divided by the number of instances.

  • The median, or the middle value. That is, the value for which half of the instances are greater and half are lower. So, if we had the following values for X: 10, 13, 17, 6, and 15, then the median would be 13. If you have an even number of instances, then there is no middle instance, so you compute the median by adding the two middle instances and dividing them by 2. For example, if you added 21 to the above instances of X, the median would now be (13 + 15)/2, or 14.?
  • The mode is the most frequent value. So if you have these values for X: 12, 21, 17, 14, 7, 8, 23, 8, 14, 20, 8, 13, then the mode is 8.

One of the great features of the normal distribution is that within it, the mean, median, and mode are the same thing. That is, the average instance is also the number with an equal number of instances above and below it and the most frequent instance.

The next great feature of the normal distribution concerns variability. It's all well and good to know the central tendency of the distribution of a variable, but that doesn't tell you a whole lot unless you know the spread of that variable, or how much each instance of the variable tends to differ from the others and from the mean. The first measure of spread, or variability, is the variance. The variance is computed like this (for a population):

σ2 = Σ(X - μ)2 / N

In the equation, σ2 is the variance, Σ(X - μ)2 the sum (Σ) of the mean subtracted from the value of each instance of X, squared, and N is the number of instances. So the variance is computed by subtracting the mean from each value of X, squaring that, adding the results, and dividing by the number of instances. Put simply, the variance is the average squared distance from the mean.

You may be wondering why the sum of X - μ is squared in the equation. Well, it's quite simple. If you added all the values above and below the mean, they'd cancel each other out. The mean is just the number with half of the distribution's value above it and half below it. So if you don't square it you get 0, and that doesn't help you very much. But a squared value is difficult to work with, so in addition to the variance as a measure of spread, we also use the standard deviation (represented as σ, for populations), which is calculated simply by taking the square root of the variance. So now you have a number that basically gives you the average distance of the values of a variable from the mean of that variable.

Perhaps the most important feature of the normal distribution, for our purposes, is the fact that it allows you to compute the probability of getting a value up to a particular value. How, you ask? Well, consider this normal distribution:

i-4a32c41d671ac303268b6483466e7110-normal

The area under the curve represents probability. The line down the middle of the distribution (at 50) is the mean. The space to the left of the mean represents 50% of the area, and thus the probability of getting a value less than the mean is 50%. The same is true of the area above the mean, and so the probability of getting a value above it is also 50%. But you knew that from the discussion of central tendency above. The wonderful thing about normal distributions is that you can also compute the probability associated with any value of a variable by computing the area under the curve to the left of that value. You do this with a nice little equation that I'm too lazy to write out, and that you'll never ever need to use anyway.

Why this is all important will begin to become clear in the next post, but I think this is enough for now.

Tags
Categories

More like this

Good idea for a post, but everything you've said that's good about the normal distribution is true of any symmetric, unimodal distribution. One of the distinctive things about the normal distribution in this class is that all higher-order moments are zero.

But, really, the normal distribution is important because of the CLT, which implies we can use the same procedures for a wide variety of problems, without having to think about what we're doing. This is also what's really awful about the normal distribution.

You hear that... it's the standing ovation you got for the quality work here!

I thought my grad class on stats broke it down easily till I came across this post. Anyone with a basic understanding of math should understand this, which is very important.

The more people understand basic stats, the more likely they are to understand the world and it's interactions. For instance, I think a better understanding of stats would lead to less people believing that 66,000 children are abducted every year in the US (not trying to get political here, just an example I heard of crappy stats). And the more intelligent our society could become.

Thanks for this post! I'll be reading the rest. I'm taking statistics next semester, but we all know how useless classes can be for learning sometimes ^^.

js, you're right, it's true of any symmetrical, unimodal distribution. But the rest of the posts are about starting from the normal distribution and going through its different properties to ultimately arrive at hypothesis testing for cases when we don't know the parameters of the population. Which is why I say that it will start to become clear in the next posts.

And you're right, it's both a blessing and a curse.

the variance is the average squared distance from the mean

Since this is a basics post I can mention my favorite heuristic for getting to, and remembering, that fact.

Model data as springs connected to a point representing the mean, each spring pulled to a distance proportional to the data, in two directions.

Spring forces are linear with distance, so the work done by each spring will be squared. We get that from work being the force applied over a distance.

Work done is energy stored, so the variance given in the post represents the average energy in a spring. We get that from the formula, seeing that variance is summed up energy (total energy) divided by number of springs (data).

Now we go back again. A 'variance' spring with average energy represents average distance from the mean.

[A variant of the same model helps to understand linear regression, btw.]

By Torbjörn Larsson, OM (not verified) on 29 Jun 2007 #permalink

Human heights happen to be a normal-ish distribution, but they're not a good illustration of the Central Limit Theorem. First of all, there is no reason to suspect that the variables are additive, random, or independent. Secondly, the appropriate "large number" in the CLT would be the number of independent quantities per person. In practice, this might be just a few parameters---suppose that your height is proportional to the sum of your parent's heights, two or so nutritional degrees-of-freedom, and a random noise variable. It's just four or five parameters; in my mind that's usually not enough to invoke the CLT. In reality, we have to look at the height distribution and say "Yeah, that looks pretty normal if we plot US males only."---we can't predict it from any useful principle.

Why is the height distribution so Gaussian-looking? Well, probably because the main input variables happen to be Gaussian-looking, and convolving two Gaussians tends to give you another Gaussian. The CLT may, however, tell you why the inputs are so normal: you could consider the heights of your ensemble of great-great-grandparents to be sixteen independent quantities which sum in a CLT way. "No matter what the human height distribution was in 1880, if offspring heights are simple averages of parent heights AND if marriages don't sort by height, then today's height distribution will be normal"---that's an accurate illustration of the CLT. "Human height depends on some random variables, and is therefore normal" is not. Just a thought.

"Work done is energy stored," - Work done on a spring is energy stored in it,

By Torbjörn Larsson, OM (not verified) on 29 Jun 2007 #permalink

Ben, you're right, of course. And I could have said all of that. And I'd have lost half the people in the first paragraph... heh.

I use height as an example, as many stats teachers do, 'cause it's something people are familiar with. I'd use IQ, but since that's a standardized distribution, it would be getting ahead.

Ben:

Interesting.

Though as a note, to come back to the blessing and the curse with the gaussian, it must be pretty much the null hypothesis in most cases, especially for noise. The problem is when people doesn't check it thoroughly or model the causes properly. But you seem to take some care. ;-)

By Torbjörn Larsson, OM (not verified) on 29 Jun 2007 #permalink

Subject Line: Beat Long Poll Lines with Absentee Ballots from StateDemocracy.org
Many state and local election officials are encouraging voters to use Absentee Ballots to avoid the long lines and delays expected at the polls on November 4th due to the record-breaking surge in newly registered voters.
Voters in most states still have time to obtain an Absentee Ballot by simply downloading an official application form available through www.StateDemocracy.org, a completely FREE public service from the nonprofit State Democracy Foundation.
Read More: http://us-2008-election.blogspot.com/2008/10/beat-long-poll-lines-with-…