I heard it again the other night. One of the TV chin strokers talking about this poll or that poll showing Obama (or McCain) ahead with a "statistically insignificant" lead, and I thought to myself, no one who knew much about statistics would use a phrase like that. Strictly speaking, while there may be something like statistical significance, there is no statistical insignificance. It is a nonsensical term that is becoming part of the language by use, so I know I can't stop it with a blog post. If I could, I would, because it invites serious misunderstanding in the speaker and listener alike. So let's recap what's wrong with saying "statistical insignificance."

What's intended by the statement? Probably something like, "Although a poll was taken showing Obama ahead by 1 percentage point, if an official election were held at the time the poll was taken we might find Obama would lose by 2 percentage points" -- or something like that. I think I'm being rather generous to the understanding of the pundits by spelling it out like this, but I'm in a generous mood. So let's first agree on the purpose of political polling. Presumably it's to estimate the relative sizes of Obama's and McCain's votes if the election were held on the day or days the poll was taken. You could determine this definitively if you found everybody who was going to vote in the election, asked them what their votes would be, and you could be assured they were telling you the truth. Let's grant the last assumption, since we wish to discuss statistics and not psychology or sociology.

Of course if we did poll everyone, it is almost the same things as holding the election instead of taking a poll, except that if we did it in the form of a poll we would immediately be faced with a problem: determining who was also going to vote that day. We know that the possibilities on voting day are limited to those eligible to vote in that jurisdiction, i.e., registered voters. But not all registered voters vote in every election. Different pollsters use different methods to figure out who a "likely voter" is and differences in "likely voter models" (the characteristics of a person that predict whether they will vote or not) vary among pollsters and therefore so will their polling results. After all, they are asking a different group of people of their models are different. To add to the problem, significant numbers of people have yet to register, so a likely voter model can't just confine itself to registered voters. Likely voters are not just a subset of registered voters (and vice versa). The only requirement is that you be *legally eligible* to vote. Pollsters don't ask 12 year olds for their opinions.

Whether registered voters (which in reality is one form of a likely voter model) or some more sophisticated model, even if we could determine who would vote if an election were held that day we wouldn't ask *everyone*. That would require finding and asking questions of around 100 million people. Instead we'd ask a representative sample of voters. Statistical theory tells us we can do almost as well by asking only a tiny fraction of these 100 million, say 1000 or so, if they are representative of the 100 million. "Represenative" has a technical meaning here. It means that if you had a list of everyone who was going to vote in November (and remember you are only guessing about who would be on the list at this point), then any name on the list has as good a chance of being asked their preference as any other. So even if you knew the exact characteristics of a November voter and a unique identifier (e.g., a name and address or telephone number), you would still need a way to make contact with them and ask them your questions and get a response. Often the best you can do is find lists you think may be good enough, like a list of land line telephone numbers. Many telephones are businesses, so they aren't of interest, and many people whose views are of interest may not be reachable that way: they don't have a land line (e.g., no phone or only a cell phone) or they may not be home when you call or they may refuse to answer your questions. The practical questions in the previous paragraph all affect this representativeness and account for many of the differences in polls taken simultaneously by different pollsters. But since we want to concentrate on statistics, let's "assume away" these practical problems, too.

The statistical methods pollsters use are solidly based in probability theory but they assume we know the kind of probabalistic process that generated the data. We are going to assume a very simple probability model for how Obama/McCain votes are generated. We will model the propensity to vote for one of the two candidates by the outcome of a coin flip, where the probability that the coin comes up heads or tails is directly related to whether any person in the polled population will vote for Obama (heads) or McCain (tails). You might think this a very unrealistic model, but we are letting the abstract "randomness" of coin flipping stand for all the unknown factors that determine how an individual person votes. We are only interested in the summary behavior of all the voters, the probability they will vote for one or the other candidate.

Now if these probabilities aren't equal ("50-50") the coin isn't a fair coin. It's a biased coin, with a tendency to turn up heads more often than tails or the reverse, depending on how it is weighted. the job of the pollster is to accurately estimate the probabilities of heads and tails in the election coin. If the probabilities are, say, .51 Obama, .49 McCain, then on average 51% of the voters would vote Obama and 49% McCain. So the pollsters are trying to estimate the chance that heads or tails will come up by flipping the coin, which corresponds to asking people whom they will vote for. Asking 100 representative voters is like flipping the coin 100 times, etc. With that underlying probability model (or one more complicated), we can do some statistics, i.e., look at some actual data from a representative sample and interpret what we see about what the much larger population of voters will do. This process of using a representative smaller sample to tell us about a much bigger population is called statistical inference. We are inferring the "big" vote from the much smaller poll.

Let's step back a minute to the world of coin flipping instead of voting, Suppose you were flipping a coin to decide if you or a friend got the last piece of candy. You'd like to be confident it was a fair coin and not a crooked on. So your friend agrees to let you test it first. If you flip it once, that doesn't tell you anything. If it comes up heads it could have heads on both sides. If you flip it twice, it is twice as likely to come up one head, one tail as either of two heads or two tails but the probability of two the same or two different are equal. If it came up two heads or two tails you still wouldn't know if it had heads on both sides or tails on both sides. So you flip it ten times. If the probabilities of heads and tails were each 50% you expect it to come up heads five times and tails five times. But that's *on average*. If you just flip it ten times it could easily ("by chance") come up heads 7 times and tails 3 times or 4 heads and 6 tails, or even (improbably) all heads or all tails. From probability theory we can calculate that the chance of 10 heads is only about one in a thousand, so if it came up all heads you might suspect (but couldn't be 100% sure) it was a crooked coin. The same might be true for a division of 9 and 1 or 8 and 2, but if you got 6 or 7 heads (or tails) you couldn't be so sure. But what if the con where just a little bit biased, say 51% chance of heads and 49% chance of tails? Then flipping it ten times wouldn't be very useful, either. How about 100 times? Again, you might be able to tell extreme departures from fairness (say a 60% probability of heads and a 40% probability of tails but not 51% and 49%).

Each time you do the coin flipping experiment (whether each experiment is 10 flips or 100 flips or 1000 flips) you will get a number for the proportions of heads and tails. It won't be the same number each time. Once maybe it will be 53 heads and 47 tails. Another time it might be 40 heads and 51 tails, etc. Each set of 100 flips is called a sample, and the proportion of heads (or tails) in each trial of 100 flips is called a sample statistic. A statistic is a number calculated from your data. The sample statistic will jump around a little bit with each set of 100 (in the example 53, 49, . . .) and if you did this 100 coin flip experiment over and over again you would get a whole slew of these numbers. It is called the distribution of the sample statistic. In the polling example, the sampling statistic is the proportion of respondents who said "Obama" when asked who they prefer. To get a sampling statistic distribution for the polling example you would have to take the poll over and over again from a representative sample, asking the same questions. Pollsters don't do this. They only do it once. So they want to be sure their single sample statistic is pretty close to the true propensity to vote for Obama in the voter population, which is the number they are really interested in but don't know.

The way they do this is to increase the number of coin flips. The underlying mathematics of this depends on a deep mathematical result called the Law of Large Numbers, but essentially this is what it says. The more often you flip the coin, the more likely it is that the sample statistic (the number of heads or the propensity to vote for Obama) will be close to the true underlying probability you are interested in. Using statistical theory we know that if you flip the polling coin around 1000 times, your sample statistic (the number of heads or the Obama vote) rarely gets very far from the true underlying value. "Not very far" means rarely more than plus or minus three points from the true value in this case (that's the "margin of error" you read about). It is possible that sometimes the error is greater, but not too often (in this case just a probability of a few hundreths). Thus, using this method you can get information on the probability of an Obama/McCain vote by only asking 1000 people instead of 100 million.

So let's get back to "statistical insignificance." This term arose as a contrary to "statistical significance." Now statistical significance is a common technical term used in relation to a specific hypothesis, for example, that the coin is fair. If you flipped a coin and got ten heads and no tails, this result would be said to be statistically significant evidence against the proposition that the coin was fair. While ten heads in a row of a fair coin is possible, it is very unlikely, a "statistical fluke." But suppose the result were 6 heads and 4 tails. It is quite possible this is a crooked coin, say with probabilities of 58% heads and 42% tails. If a casino had a game chance based on flipping this coin, in the long run they'd come out ahead. But your test of 6 heads and 4 tails wouldn't identify that possibility, i.e., you couldn't tell your coin from a fair coin. Is the difference between 6 heads and 4 heads "insignificant"? No. It doesn't say at all that the coin is fair, only that it *might* be fair or it might be crooked with a certain plausible level of crookedness. That may seem like splitting hairs, but the use of the word "insignificant" makes the difference between heads and tails sound as if it is not real and that the true probabilities of voting for Obama versus McCain were really the or, at worst, almost the same. It doesn't have to say that, but that is often what people hear. Moreover how important the difference between 60% heads and 40% tails is depends on how many times you flipped the coin (i.e., how many people in your sample). A difference of 60% to 40% means something different in a sample of 10 than it does in a sample of 100 or a sample of 1000. So it's not just the difference that is significant or not, but the difference for that sample size. Nor does the statistical question have anything at all to do with the possibility of sample bias, all the issues we raised at first about the list and representativeness of the sample and the underlying model.

The other problem is that the words "significant" and "insignificant" can be misread as "important" and "not important" (the common uses of the word). But statistically significant differences can be of no importance for many purposes while in politics a difference of one vote can be the difference between winning and losing. That's a problem with the whole "significant" terminology, which unfortunately is even harder to fix.

You can say my complaint about the use of the phrase "statistical insignficance" is a bit of a straw man argument, and I'll grant you it is. But it gives us a chance to talk about a lot of things related to it.

And anyway, it annoys me.

- Log in to post comments

Interesting point, but one complaint. Your statement

"The underlying mathematics of this depends on a deep mathematical result called the Central Limit Theorem" should have referred to the Law of Large Numbers. The CLT explains why the behavior of the different sample proportions settles down to a normal law: the LLN explains why we can be sure the proportions eventually get close to the "true" percentage.

I think you're being a little fussy here, aren't you?

You may not like the actual phrase, but it's surely better to make it clear that there's really nothing in it between Obama and McCain if there is only one percentage point between them. It has to be better than the totally accurate - but misleading - tabloid headline of "Obama leads McCain in the polls" !

Dean: My face is red. Mental lapse on my part. You are completely correct. I have corrected it in the post. Many thanks.

Martin: That's why I included the paragraph at the end. It annoys me.

It's as annoying as saying it's a statistical tie or a dead heat.

45-45 are those things. Obama up by one or McCain up by one is neither. If it were so, then FL 2000 would have been declared a stastical tie and Bush and Gore would each have had a desk in the oval office, or some such nonsense.

"Too close to know" or "well within the margin of error"is really more appropriate.

If you are a journalist or an ordinary informed citizen, do check out the 20 questions from the National Council on Public Polling, especially # 19:

So, talking about "statistically insignificant" is a bit like talking about a hole in the International Space Station in terms of how much vacuum leaks into the station?

I only mean this as real light criticism but in the spirit of statistical "pet peeves" you actually got into one of mine. While you did a nice job explaining sampling variance and confidence intervals you relied on the well worn path of coin-tosses to do it. I don't have a huge problem with that in this context but it does raise my pet peeve about assuming independence of the sample (like coin tosses) in a context where there is no such independence (like voting).

This is actually most annoying in sports where some otherwise smart people make statements about related events using statistics that assume the events are independent when they are most definitely not.

floormaster: Isn't the question here whether the

eventsare independent? Thus A's answer doesn't affect B's answer and vice versa, i.e., P(A and B)= P(A)*P(B)."Many telephones are businesses, so they aren't of interest, and many people whose views are of interest may not be reachable that way: they don't have a land line (e.g., no phone or only a cell phone) or they may not be home when you call or they may refuse to answer your questions"

Are polls only using land-lines? I would think this would definitely skew results, as a large % of under-thirtyʻs only use cell phones. And does the use of answering machines and call-screening effect results? Any answers? Mahalo, Terry McNeely

terry: Federal regulations do not allow autodialing of cell phone numbers. They have to be manually entered. Accepting such a call potentially costs money, so there are restrictions. I don't know if there are even lists of all cell phone numbers the way there are of land line numbers. The cell phone under count is of serious concern to pollsters because it is known that cell phone only people are substantially different than those that can be reached by a landline (although they may also have a cell phone).

Revere,

thank you. i suspect a larger proportion of cell-phone only people prefer Obama over McCain.

Revere, that was a very interesting (albeit verbose) explanation of the issue. :)

Completely aside from politics and the ignorance of pollsters, this smacks of a common trap that many

scientistsfall into. I.e., if the p-level isn't .05 or less, there is no result.First of all, the 95% confidence interval (p=.05) is a completely arbitrary number. Granted, it has since acheived practically sacrosanct status in the scientific community, but really, .05 is just a number. I understand the general inclination for some sort of "standard" per se, but people should understand that confidence intervals and p-levels are a continuum, not a cliff that drops off at 5%.

As you alluded to in your post, significance is really a function of two things: sample size and effect size. A truly thoughtful person needs to understand all three statistical components to fully understand what is going on in the world. For example, if a person had an effect size r (or a correlation) of .30 (generally signifying a moderately strong effect), but the p-level was .06 (gasp!) does that mean that the result is insignificant (i.e. not real), since p was not less than .05? No, of course not. It may have just been an issue of too small a sample size.

On the flip-side, effect size isn't the entire story either. My stats professor in grad school used to love telling the classic aspirin story... as he told it, when scientists were measuring the effect of aspirin on heart attacks, they found a miniscule pathetic effect size r of .04. That's so close to zero, it's practically nothing. But once the researchers found this effect size, they immediately stopped the study because they thought that it would be unethical to not give heart disease patients aspirin. When compared to the number of patients at risk for heart attacks, that itty bitty .04 effect size translated into thousands of lives saved.

Anywho, the moral of the story, for all those scientists, researchers, and statisticians out there... make sure you understand the entire picture, instead of relying on just one statistic.

On a different note, Terry that's an excellent point. More and more people are using cell phones as their sole phone line, and generally these people a little different from the general population. I hope the telephone survey-ers in the world are taking note.

the cell phone Q is an interesting one... gallup, CBS and others include some cell phones. if you're really into polling, check out www.pollster.com for more.

Tasha: Yes, all your points are correct. The egregious error of thinking that not rejecting the null is the same as accepting the null is extremely common. There was much more I could have said (verbose as I was) and the confusion between public health signficance and statistical significance is one of my hobby horses. Things can be of public health significance and not statistical significance and vice versa. I purposely didn't comment on the 0.05 level because there is too much to say about it (I once tried to find its historical origins but couldn't pin it down completely), except that most biologists and public health folks aren't aware that physicists frequently use .10 as a convention. But we know that's a soft science!

Dem: Thanks for the link. The cell phone Q. is indeed interesting. I'll head over to read it now.

here:

Cell Phones and Political Surveys: Part I

Cell Phones and Political Surveys: Part II

New Pew Data on Cell Phones

More Cell Phone Survey News

Revere: You are right about noting I was talking about assuming the voting events are independent. It may be of little importance as most of the voting events are independent.

The issue of cell phones on polling really are not that new--that is, if you honestly want to control for it you can without too much of effect on the variance. It is getting worse I suppose though.

Random dialing modes were always biased--larger households and larger incomes had more phone lines and there has always been a percent of people who did not have a phone. This was a problem in most of the 90's for RDD health insurance surveys as not having a phone was shown to be correlated highly with not having health insurance in the Census/BLS (CPS) data.