Suppose you had a pair of dice and were wondering if they were fair. The average number you will get on a pair of fair dice is seven, so one way you could check your dice is to roll them a few times and look at the average of the results. Trouble is, you aren’t likely to get an average of exactly seven. Suppose you get an average of 9. Are the dice fair? Well, that depends on how many times you rolled them.
I rolled\* a pair of dice twice and averaged the results. I repeated the experiment 1000 times and plotted all the averages. 95% of the averages lie between the two horizontal lines, between 4.5 and 9.5. We would say that getting an average of 9 was not particular unusual and we would not be able to conclude that the dice were unfair.
This graph shows what happens if you average 50 rolls instead of two. Notice that he averages lie much closer to 7. Now 95% of the averages are between the two horizontal lines, between 6.5 and 7.5. This time, getting an average of 9 is very unusual and we can conclude that the dice were almost certainly unfair.
The important difference between averaging two rolls and averaging 50 rolls is the distance between the two horizontal lines. For two rolls it is 9.5-4.5=5, while for 50 rolls it is 7.5-6.5=1. Rather than use this distance, statisticians use something called the standard error which is about 1/4 of the distance, that is, 1.25 and 0.25 in our two examples. The usual way things are expressed is that a result is statistically significant if it is more than two standard errors from the expected one. So, in the two rolls case 9 is not statistically significant since it is (9-7)/1.25=1.6 standard errors away from 7, while in the 50 rolls case 9 is statistically significant since it is (9-7)/0.25=8 standard errors away.
Notice that if you think you have more dice rolls than you really have, then you might think that the standard error is smaller than it really is and decide that a result is statistically significant when it really isn’t.
You might think that it is really easy to count the number of rolls, but it is really easy to go astray. This time I had a red die and a green die. While I rolled the green die 50 times, I only rolled the red die once. I added the score on the red die and the green die to get 50 different totals and plotted the averages. Even though I took the average of 50 different rolls the results look much more like the first example with only four rolls than like the second example with 50 rolls. The reason for this is that the red die is shared by all the rolls so you aren’t really rolling as many dice as you think. The technical term for this clustering, and adjusting the standard errors to allow for clustering is the clustering correction. In this case, the clustering correction would increase the standard errors from 0.25 to 1.25
What has all this to do with the “More Guns, Less Crime” data? Well, when you think of random changes in the crime rates in a particular county, some of the factors causing crime to change just operate within that county (that corresponds to the green die above), while others operate statewide (that corresponds to a red die shared by all the counties within a state). So it is necesssary to make a clustering correction to the standard errors in the “More Guns, Less Crime” data.
\* OK, I didn’t really roll dice, but simulated them on a computer.