So here’s the first post on statistics. If you know the basics, and I suspect most of you do, then you can just ignore these posts (unless you want to check to make sure I’m getting it right). If you don’t know the basics, then hopefully you will when I’m done. Even for those of you who’ve never taken a stats class, much of this will probably be familiar, but I’m going to start from the assumption that I’m writing for someone who has no knowledge of statistics whatsoever, so bear with me. Alright, let’s begin.
The Normal Distribution
In cognitive psychology, two related types of statistics are used: descriptive and inferential. Descriptive statistics are just what the name says: descriptions of data. Inferential statistics are used to draw inferences about populations from samples. Since the two are related, I’m going to talk about them both pretty much at the same time. And to do that, we have to start with the normal distribution, and Central Limit Theorem (CLT).
In essence, CLT says that if you have a bunch of independent, random variables that (additively) determine the value of another variable, then as long as you meet a few constraints (particularly, finite variance, but that won’t make sense until we get to variance), then the distribution of that variable will be approximately normal. Take, for example, height. A person’s height is determined by a bunch of independent random variables like genetics, nutrition, and the amount of solar radiation (maybe?), so at least within a population (say, the adult population in the United States), height will tend to be normally distributed. That is, if you calculate the number of people at each particular height, and then graph those frequencies (represented as probabilities) for one gender, you’ll get a graph that looks something like this (from here):
That’s the classic “bell curve,” or the normal distribution.
Now the reason CMT is important is because it lets us (by us, I mean psychologists) assume the normal distribution in most cases, and that’s important because the normal distribution has certain well known properties that make it excellent for computing both descriptive and inferential statistics. We’ll start with measures of central tendency. There are three basic measures of central tendency”:
- The mean, which is just the average. I’m sure you know how the mean is computed, but just in case, you compute it like this:
μ = ΣX/N
Where μ is the mean, ΣX is the sum of all of the instances of variable X, and N is the number of instances. Put more simply, the mean is just the sum of all the instances divided by the number of instances.
- The median, or the middle value. That is, the value for which half of the instances are greater and half are lower. So, if we had the following values for X: 10, 13, 17, 6, and 15, then the median would be 13. If you have an even number of instances, then there is no middle instance, so you compute the median by adding the two middle instances and dividing them by 2. For example, if you added 21 to the above instances of X, the median would now be (13 + 15)/2, or 14.?
- The mode is the most frequent value. So if you have these values for X: 12, 21, 17, 14, 7, 8, 23, 8, 14, 20, 8, 13, then the mode is 8.
One of the great features of the normal distribution is that within it, the mean, median, and mode are the same thing. That is, the average instance is also the number with an equal number of instances above and below it and the most frequent instance.
The next great feature of the normal distribution concerns variability. It’s all well and good to know the central tendency of the distribution of a variable, but that doesn’t tell you a whole lot unless you know the spread of that variable, or how much each instance of the variable tends to differ from the others and from the mean. The first measure of spread, or variability, is the variance. The variance is computed like this (for a population):
σ2 = Σ(X – μ)2 / N
In the equation, σ2 is the variance, Σ(X – μ)2 the sum (Σ) of the mean subtracted from the value of each instance of X, squared, and N is the number of instances. So the variance is computed by subtracting the mean from each value of X, squaring that, adding the results, and dividing by the number of instances. Put simply, the variance is the average squared distance from the mean.
You may be wondering why the sum of X – μ is squared in the equation. Well, it’s quite simple. If you added all the values above and below the mean, they’d cancel each other out. The mean is just the number with half of the distribution’s value above it and half below it. So if you don’t square it you get 0, and that doesn’t help you very much. But a squared value is difficult to work with, so in addition to the variance as a measure of spread, we also use the standard deviation (represented as σ, for populations), which is calculated simply by taking the square root of the variance. So now you have a number that basically gives you the average distance of the values of a variable from the mean of that variable.
Perhaps the most important feature of the normal distribution, for our purposes, is the fact that it allows you to compute the probability of getting a value up to a particular value. How, you ask? Well, consider this normal distribution:
The area under the curve represents probability. The line down the middle of the distribution (at 50) is the mean. The space to the left of the mean represents 50% of the area, and thus the probability of getting a value less than the mean is 50%. The same is true of the area above the mean, and so the probability of getting a value above it is also 50%. But you knew that from the discussion of central tendency above. The wonderful thing about normal distributions is that you can also compute the probability associated with any value of a variable by computing the area under the curve to the left of that value. You do this with a nice little equation that I’m too lazy to write out, and that you’ll never ever need to use anyway.
Why this is all important will begin to become clear in the next post, but I think this is enough for now.
