Before we start in on new stuff, let's recap what we've covered so far. We started with the Central Limit Theorem, which tells us that if a bunch of random variables go into determining the values of yet another variable, then the values of that variable will approximate a normal distribution. The normal distribution is great because the measures of central tendency -- the mean, median, and mode -- converge, and because the measures of spread (variance and standard deviation) can be associated with specific probabilities (derived from the area under the curve in the distribution).
Then we saw that with a simple equation, we can convert any normal distribution into a standard normal distribution, with a mean of 0 and a standard deviation of 1. This is great because it allows us to use probabilities that we already know, instead of having to use a long equation to get the probabilities associated with particular values of a variable. It also allows us to compare variables that are measured with different units (e.g., pounds and inches, say). I don't think I mentioned that before, but it's a definite advantage of using a standardized normal distribution.
Simple stuff, really. The next step is where some people get confused, so I'm going to spend a little bit more time on this post's topic than the last. Before we start, a note about notation. In the previous two posts, all of the equations have used Greek symbols, specifically μ, σ2, and σ to refer to the mean, variance, and standard deviation respectively. The Greek symbols are used when we're talking about populations. When we're talking about samples, which is what we'll be doing most of the time from now on, we drop the Greek pretensions and go with Roman letters. So μ becomes x, σ2 becomes s2, and σ becomes s. They still refer to the mean, variance, and standard deviation, but of samples from a population instead of the population itself.
What's a sample? Well, it's just instances drawn from the population. Why would you draw a sample from a population? Well, to learn about the population, of course. If you know the mean and standard deviation of a population, you'd have no need to take a sample from it. In most cases, though, we don't know the parameters (fancy word for things like the mean and standard deviation) of the population, and measuring the value of a variable for every member of a population is rarely feasible, and is actually suboptimal in some cases [removed because it was wrong -- see comments]. So the best thing to do is to take a sample to estimate the parameters of its parent population.
Of course, you want that sample to be representative of the population (a representative sample), but it's usually difficult to tell whether a sample is in fact representative, since you don't know the parameters (mean, variance, etc.) of the population. So the next best thing is a random sample. Truly random samples should give you a good approximation of the population's parameters. In fact, a common misconception is that small samples can't tell you anything about a population. For a really small sample (what a small sample is depends on what you're measuring, really), there can be problems, but once you reach a certain size, as long as you've taken steps to ensure that your sample is random, you can make reliable inferences about the population from it.
The equations for the parameters of a sample are the same as those for the population, but with the Roman letters in place of the Greek ones. Just for review, I'll give you the equations again, this time for samples:
- mean: x = Σx/n, which just means the sum of all the instances of variable X in the sample, divided by the number of instances in the sample (n; we'll use little n's for samples).
- variance: s2 = Σ(x - x)2 /n, or the sum of the squares of the mean subtracted from each instance of x, divided by the number of instances.
- standard deviation: s = √s2
Got it? OK, here's where things get tricky. If we really wanted to get the best estimate of the population's parameters, then instead of taking one random sample, we'd take a bunch of random samples. That's because the parameters of random samples will tend to be off a little from those of the population (why this is the case should be fairly obvious, but if it's not, think back to the talk about probabilities in the last two posts). If instead of taking one sample, we take 30 samples, and combine the means of each sample into a distribution, we will have a sampling distribution of the mean, or just a sampling distribution.
Before we talk more about the sampling distribution, head on over to this page and click "Begin." You'll see a normal distribution for a population at the top, and three empty graphs below it. The second graph from the top is where your sample data will go. Click "Animate," and it will give you a sample from the population. It will then take that sample's mean, and stick it in the third graph from the top. This is the graph for the sampling distribution. After you've clicked "Animate" a couple times, click "5." This will give you five samples, and then put all five means from those samples into the sample distribution graph. Doesn't look like much yet, does it? Next, click 1,000.
Back? OK, what did you notice about the sampling distribution? With 1,000 sample means in it, it looked almost normal, didn't it? And it lined up nicely with the population distribution in the top graph. Even the means looked the same (with enough sample means, they should be the same). But there's something else that you probably noticed (everyone does). Compared to the population distribution, the sampling distribution is really thin, and if you were watching closely, you could see that the more samples you added, the thinner it got. Why is that? Well, because every value that goes into the sampling distribution is a sample mean, and because every sample is drawn from the population and therefore their means are approximations of the population mean, the values that go into the sampling distribution will tend to converge around the population's mean, making for a distribution with much less spread (or variance) than the population distribution.
We can now compute the same parameters for the sampling distribution that we compute for populations and samples. The mean of the sampling distribution for a population is the same as the mean for the population, μ. The most important parameter, for our purposes, will be the standard deviation. For a population with a standard deviation σ, the standard deviation of the sampling distribution will be:
σx = σ/√N
That is, the standard deviation for a sampling distribution from a population will be equal to the standard deviation of the population divided by the square root of the sample size (N). But instead of calling this the standard deviation, we usually call this the standard error of the mean, or just the standard error (represented as σx). The great thing about this is that we can now use the area under the sampling distribution curve to estimate the probabilities associated with getting a particular value for the mean of a sample, using the standard error like we used the standard deviation to compute probabilities for population distributions. That will come in handy when we're hypothesis testing. Oh, and if you're wondering, yes, you can compute a sampling distribution of any parameter, not just means. But we'll mostly be concerned with means for now.
You might have noticed a problem in the discussion of the sampling distribution. Computing the standard error requires knowing the standard deviation of the population (σ), and the reason we were taking samples in the first place was because we didn't know σ. That could be a big problem, but we're in luck. We can just substitute the standard deviation we got from the sample as an estimate of the standard deviation of the population. The only issue with this is that when we doing so, we're dealing a different distribution, the Student's t-distribution (if you want to know how it got that name, look it up; it's a cool story involving beer). The equation for the t-distribution is this:
t = (x - μ)/(s/√n)
Or the sample mean minus the population mean, divided by the standard deviation of the sample over the square root of the sample size (i.e., the standard error). The standard error in a t-distribution is represented as sx, or sometimes just s.e.
But there's a complication with the t-distribution. There isn't, in fact, a single t-distribution. The actual shape of the t-distribution differs depending on the degrees of freedom. What's that? Well, when you compute a mean using a sample, all but one of the instances is free to vary. That is, if you know the mean and all but one of the instances, you can compute the value of the final instance. So the degrees of freedom -- the number of instances free to vary -- is df = n - 1, or the sample size minus one. Once you know which t-distribution you're working with (by computing the degrees of freedom), you can then compute the probability associated with your sample mean using the equation above.
Which brings us to the final issue for this post. Because the value of one of the instances is not free to vary, when using a sample to estimate the population's standard deviation, as we do when computing a t-distribution using the above equation, we use degrees of freedom instead of the sample size in the equation. So in the equation for the variance of a sample above, just replace n with n - 1. The standard deviation is still the square root of the variance. So, anytime you're dealing with populations, the variance is computed with n, and anytime you're dealing with samples, use n-1.
Next up, confidence intervals.
is actually suboptimal in most cases (the more measurements you take, the more errors you make)
That was a new one. The variance of the measurement error goes down with more measurements.
So I'm curious. Do you have an example in what situation more measurements would be suboptimal vs error? (I can believe that eventually they would be suboptimal vs gain (result for cost). Sampling a static, decent (for example, non-stratified), distribution very quickly becomes fairly accurate, as I see you note later.)
Nice introduction of degrees of freedom, btw.
Need some clarification..
You might have noticed a problem in the discussion of the sampling distribution. Computing the standard error requires knowing the standard deviation of the population (Ï), and the reason we were taking samples in the first place was because we didn't know Ï.
Yeah, but the sampling distribution is itself a distribution, so can't one just compute the variance of this distribution and take the sq. root?
We can just substitute the standard deviation we got from the sample as an estimate of the standard deviation of the population.
You mean the sample distribution, right?
Gyan, good question about standard error. Since the sampling distribution is derived from a bunch of samples from the same population, its standard deviation (which is the standard error) will be a function of the standard deviation of that population.
And yes, when I said sample, I meant its distribution.
Another thing, Chris.
First you state,
For a population with a standard deviation Ï, the standard deviation of the sampling distribution will be: ..[snipped].. equal to the standard deviation of the population divided by the square root of the sample size (N).
So, in other words, SD(population) = SD(sampling distribution) x sq.root(N in the sampling distribution). Is that right?
But later on, you state,
We can just substitute the standard deviation we got from the sample as an estimate of the standard deviation of the population.
'Sample' here, again, refers to the sampling distribution, right? If so, isn't the estimate off by a factor of sq.root(N), as per the earlier equation?
Actually no, you're substituting the standard deviation from the sample, not from the sampling distribution. It will be off, but it's your best estimate.
you're substituting the standard deviation from the sample
But which? The sampling distribution, by definition, contains many.
Well, we're talking about a t-distribution when we're estimating the population's standard deviation with the sample's standard deviation. A t-distribution is like an approximation of a sampling distribution, but with only one sample.
measuring the value of a variable for every member of a population is rarely feasible, and is actually suboptimal in most cases (the more measurements you take, the more errors you make)
I agree with the first commenter - this doesn't make a lot of sense, unless you're assuming some sort of fatigue effect. Which would be a weird assumption.
I should have said trying to measure the whole (large) population.
A great example of how sampling is more accurate than trying to measure a large population is the U.S. Census. As you probably know, the census involved trying to count everyone. However, this led to the systematic underrepresentation of certain subpopulations (minorities, the poor, immigrants, and so on), because they were likely to be missed by census-takers, and were less likely to return their census forms. This is pretty much how it works anytime you try to measure an entire population: you will inevitably miss individuals, and you will be likely to miss certain sub-groups within the population. In essence, a random sample is easier to make representative than a survey that attempts to measure the entire population.
the standard error requires knowing the standard deviation of the population (Ï), and the reason we were taking samples in the first place was because we didn't know Ï. That could be a big problem, but we're in luck. We can just substitute the standard deviation we got from the sample as an estimate of the standard deviation of the population.
At this point in your post, you hadn't described the t-distribution yet and the 2nd sentence prior to the one offering the substitution solution refers to 'samples'. So, how is the mean of the 'sample' calculated?
Oh, ok. That's certainly true of the census, but that's a specific issue of not being able to measure the entire population, but assuming that you can. That's not the same as an increase in error with additional sampling units.
Also, the sample-based census is more accurate than a population-based census because of carefully weighted sampling techniques and a different measurement methodology for small groups than for large groups. A simple enumeration based on a simple random sample of the U.S. population would have the same bias as the simple enumeration of the entire population, and would have higher variance.
you will inevitably miss individuals, and you will be likely to miss certain sub-groups within the population.
Ah, thanks! Seems like a case of a physicist meeting a biologist - what we call noise factors you call interesting subgroups. :-) And then you will start to discuss the problem of discriminating between real effects and systematic errors with t-tests. :-P
No, actually I think you are discussing the indecent (phew!), stratified (ugh!), non-static (shiver...) distributions I excluded.
And what js said.
So, how is the mean of the 'sample' calculated?
If you are referring to the 'mean of a sample distribution' you would need the means of each distribution, and you multiply that by 1 over n (where n is the number of distributions you have). The formula is something like Xbar=1/n(X1+X2+...+Xn)