Before we start in on new stuff, let’s recap what we’ve covered so far. We started with the Central Limit Theorem, which tells us that if a bunch of random variables go into determining the values of yet another variable, then the values of that variable will approximate a normal distribution. The normal distribution is great because the measures of central tendency — the mean, median, and mode — converge, and because the measures of spread (variance and standard deviation) can be associated with specific probabilities (derived from the area under the curve in the distribution).
Then we saw that with a simple equation, we can convert any normal distribution into a standard normal distribution, with a mean of 0 and a standard deviation of 1. This is great because it allows us to use probabilities that we already know, instead of having to use a long equation to get the probabilities associated with particular values of a variable. It also allows us to compare variables that are measured with different units (e.g., pounds and inches, say). I don’t think I mentioned that before, but it’s a definite advantage of using a standardized normal distribution.
Simple stuff, really. The next step is where some people get confused, so I’m going to spend a little bit more time on this post’s topic than the last. Before we start, a note about notation. In the previous two posts, all of the equations have used Greek symbols, specifically μ, σ2, and σ to refer to the mean, variance, and standard deviation respectively. The Greek symbols are used when we’re talking about populations. When we’re talking about samples, which is what we’ll be doing most of the time from now on, we drop the Greek pretensions and go with Roman letters. So μ becomes x, σ2 becomes s2, and σ becomes s. They still refer to the mean, variance, and standard deviation, but of samples from a population instead of the population itself.
What’s a sample? Well, it’s just instances drawn from the population. Why would you draw a sample from a population? Well, to learn about the population, of course. If you know the mean and standard deviation of a population, you’d have no need to take a sample from it. In most cases, though, we don’t know the parameters (fancy word for things like the mean and standard deviation) of the population, and measuring the value of a variable for every member of a population is rarely feasible, and is actually suboptimal in some cases [removed because it was wrong — see comments]. So the best thing to do is to take a sample to estimate the parameters of its parent population.
Of course, you want that sample to be representative of the population (a representative sample), but it’s usually difficult to tell whether a sample is in fact representative, since you don’t know the parameters (mean, variance, etc.) of the population. So the next best thing is a random sample. Truly random samples should give you a good approximation of the population’s parameters. In fact, a common misconception is that small samples can’t tell you anything about a population. For a really small sample (what a small sample is depends on what you’re measuring, really), there can be problems, but once you reach a certain size, as long as you’ve taken steps to ensure that your sample is random, you can make reliable inferences about the population from it.
The equations for the parameters of a sample are the same as those for the population, but with the Roman letters in place of the Greek ones. Just for review, I’ll give you the equations again, this time for samples:
- mean: x = Σx/n, which just means the sum of all the instances of variable X in the sample, divided by the number of instances in the sample (n; we’ll use little n’s for samples).
- variance: s2 = Σ(x – x)2 /n, or the sum of the squares of the mean subtracted from each instance of x, divided by the number of instances.
- standard deviation: s = √s2
Got it? OK, here’s where things get tricky. If we really wanted to get the best estimate of the population’s parameters, then instead of taking one random sample, we’d take a bunch of random samples. That’s because the parameters of random samples will tend to be off a little from those of the population (why this is the case should be fairly obvious, but if it’s not, think back to the talk about probabilities in the last two posts). If instead of taking one sample, we take 30 samples, and combine the means of each sample into a distribution, we will have a sampling distribution of the mean, or just a sampling distribution.
Before we talk more about the sampling distribution, head on over to this page and click “Begin.” You’ll see a normal distribution for a population at the top, and three empty graphs below it. The second graph from the top is where your sample data will go. Click “Animate,” and it will give you a sample from the population. It will then take that sample’s mean, and stick it in the third graph from the top. This is the graph for the sampling distribution. After you’ve clicked “Animate” a couple times, click “5.” This will give you five samples, and then put all five means from those samples into the sample distribution graph. Doesn’t look like much yet, does it? Next, click 1,000.
Back? OK, what did you notice about the sampling distribution? With 1,000 sample means in it, it looked almost normal, didn’t it? And it lined up nicely with the population distribution in the top graph. Even the means looked the same (with enough sample means, they should be the same). But there’s something else that you probably noticed (everyone does). Compared to the population distribution, the sampling distribution is really thin, and if you were watching closely, you could see that the more samples you added, the thinner it got. Why is that? Well, because every value that goes into the sampling distribution is a sample mean, and because every sample is drawn from the population and therefore their means are approximations of the population mean, the values that go into the sampling distribution will tend to converge around the population’s mean, making for a distribution with much less spread (or variance) than the population distribution.
We can now compute the same parameters for the sampling distribution that we compute for populations and samples. The mean of the sampling distribution for a population is the same as the mean for the population, μ. The most important parameter, for our purposes, will be the standard deviation. For a population with a standard deviation σ, the standard deviation of the sampling distribution will be:
σx = σ/√N
That is, the standard deviation for a sampling distribution from a population will be equal to the standard deviation of the population divided by the square root of the sample size (N). But instead of calling this the standard deviation, we usually call this the standard error of the mean, or just the standard error (represented as σx). The great thing about this is that we can now use the area under the sampling distribution curve to estimate the probabilities associated with getting a particular value for the mean of a sample, using the standard error like we used the standard deviation to compute probabilities for population distributions. That will come in handy when we’re hypothesis testing. Oh, and if you’re wondering, yes, you can compute a sampling distribution of any parameter, not just means. But we’ll mostly be concerned with means for now.
You might have noticed a problem in the discussion of the sampling distribution. Computing the standard error requires knowing the standard deviation of the population (σ), and the reason we were taking samples in the first place was because we didn’t know σ. That could be a big problem, but we’re in luck. We can just substitute the standard deviation we got from the sample as an estimate of the standard deviation of the population. The only issue with this is that when we doing so, we’re dealing a different distribution, the Student’s t-distribution (if you want to know how it got that name, look it up; it’s a cool story involving beer). The equation for the t-distribution is this:
t = (x – μ)/(s/√n)
Or the sample mean minus the population mean, divided by the standard deviation of the sample over the square root of the sample size (i.e., the standard error). The standard error in a t-distribution is represented as sx, or sometimes just s.e.
But there’s a complication with the t-distribution. There isn’t, in fact, a single t-distribution. The actual shape of the t-distribution differs depending on the degrees of freedom. What’s that? Well, when you compute a mean using a sample, all but one of the instances is free to vary. That is, if you know the mean and all but one of the instances, you can compute the value of the final instance. So the degrees of freedom — the number of instances free to vary — is df = n – 1, or the sample size minus one. Once you know which t-distribution you’re working with (by computing the degrees of freedom), you can then compute the probability associated with your sample mean using the equation above.
Which brings us to the final issue for this post. Because the value of one of the instances is not free to vary, when using a sample to estimate the population’s standard deviation, as we do when computing a t-distribution using the above equation, we use degrees of freedom instead of the sample size in the equation. So in the equation for the variance of a sample above, just replace n with n – 1. The standard deviation is still the square root of the variance. So, anytime you’re dealing with populations, the variance is computed with n, and anytime you’re dealing with samples, use n-1.
Next up, confidence intervals.