Now on ScienceBlogs: Conference blogging: icons for presenters [Genetic Future]

Seed Media Group

The Week In ScienceBlogs: Sign up for our newsletter.

Good Math, Bad Math

Finding the fun in good math; Shredding bad math and squashing the crackpots who espouse it.

Search

Profile

markcc.jpg
Mark Chu-Carroll (aka MarkCC) is a PhD Computer Scientist, who works for Google as a Software Engineer. My professional interests center on programming languages and tools, and how to improve the languages and tools that are used for building complex software systems.

Donors Choose

Other Information

Add this blog to my Technorati Favorites!

Recent Posts

Recent Comments

Categories

Blogroll

Old Topic Indices

Great Online Books

« How Not to Do Message Integrity, featuring CBC-MAC | Main | Reviewing the TMobile G1 - aka the Google Android Phone »

Margin of Error and Election Polls

Category: statistics
Posted on: October 28, 2008 9:26 PM, by Mark C. Chu-Carroll

Before I get to the meat of the post, I want to remind you that our DonorsChoose drive is ending in just a couple of days! A small number of readers have made extremely generous contributions, which is very gratifying. (One person has even taken me up on my offer of letting donors choose topics.) But the number of contributions has been very small. Please, follow the link in my sidebar, go to DonorsChoose, and make a donation. Even a few dollars can make a big difference. And remember - if you donate one hundred dollars or more, email me a math topic that you'd like me to write about, and I'll write you a blog article on that topic.

This post repeats a bunch of stuff that I mentioned in one of my basics posts last year on the margin of error. But given some of the awful rubbish I've heard in coverage of the coming election, I thought it was worth discussing a bit.

As the election nears, it seems like every other minute, we hear predictions of the outcome of the election, based on polling. The thing is, pretty much every one of those reports is utter rubbish.

What happens is that they look at polls, and they talk about the results and what they mean. But they, like almost everyone, use the margin of error as if it means something very different than what it really does.

What you hear is that, for example, Barak Obama is leading Florida by 5 points, but the margin of error is +/- 4%, so it's really not a significant lead. What the journalists seem to think it means is that the margin of error is a total measure of the accuracy of the polls - that the poll result is within the margin of error of the "true" result that the poll measures. So, by that interpretation, the poll is predicting an outcome of 52/48, and the margin of error means that the range of actual voter preferences ranges between 48/52 and 56/44.

The thing is, that's not what the margin of error means. The margin of error is a statistical measure of the probabilistic size of errors caused by unintentional sampling errors.

Polls - and much of statistics in general - are based on the idea of sampling. Given a large, relatively uniform population, you can get an amazingly accurate measure of that population by looking at a small subset of it, called a representative sample. A sample is a randomly selected group that is intended to be a microcosm of the entire population. In an ideal representative sample, the sample must have the same distribution of differences as the population as a whole.

There's a big problem there: how can you be sure that your sample is representative? The answer is, you can't! The only way to know for certain that a sample is representative is to measure the entire population, and compare the results of doing that to the sample. But once you've measured the entire population, what's the point of looking at a sample?

Fortunately, we can assess how likely it is that our sample is a good representation of the population. That's what the margin of error does - it measures the likelihood of the sample being representative of the population. It's computed by combining a bunch of factors, the primary ones most commonly being the size of the population and the size of the sample. Given those, we can assess how certain you can be of your measure being pretty close to accurate. Typically, we describe that certainty by stating how large an interval you need to define on either side of the measured statistic to be 95% certain that the "actual" value is within that interval. The size of that interval is the margin of error.

So when you hear a pollster talking about a "poll of likely voters showing that Obama is ahead by 8 points with a margin of error of +/-4%", the big thing you should do is realize what they're measuring. In that case, the population isn't "the set of people who are going to vote next tuesday" (even though that's what the journalists try to make you think); the population is "the set of people who the poll believes are likely to vote next tuesday". So the margin of error is a measure of how well their poll matches the population of people who they believe are likely to vote - which is quite a different thing from the population of people who actually do vote. In fact, it actually does slightly less, even, than that: it measures how much sampling error is contained in their poll due to unintentionally selecting a non-representative sample. That's not really saying very much in an election poll.

The population being sampled by polls is likely to be quite different from the actual population of voters for a number of reasons, and this difference produces measurement errors that almost certainly significantly outweigh the unintentional sampling errors measured by the margin of error. For example:

Intentional Sample Bias
Intentional sample bias covers a variety of techniques that pollsters use when they select people for the sample. For an extreme example, some polls (like, I think, Zogby) try to get an equal number of people who self-identify as republicans and democrats. But in most states, the number of party members in the two major parties are not equal. They are, in fact, often pretty dramatically uneven. A less dramatic but still significant one is that many polls do their polling through phone calls, and only call land-lines. Many younger people no longer have land-lines; the exclusion of cell-phone numbers therefore excludes some portion of the population from the sample. These kinds of sample bias produce a significant mismatch between the population of real voters, and the population being sampled.
Unknown Population
The biggest of polling errors leading up to an election is the fact that the real population is unknown. No one is sure who's going to vote - which means that no one is certain of what the correct population to sample is. Pollsters try to identify a sample of people who are likely to vote. But since the population is unknown, they don't know if they're including people in the sample who aren't in the actual population of voters, and they don't know if they're excluding people from their samples who are going to vote. In this election, this is likely to be a significant effect, because huge numbers of people registered to vote for the first time, but no one knows how many of those newly registered voters are likely to show up and vote. Once again, there's a problem related to the fact that the population that they're sampling isn't the same as the population that the poll is trying to measure - so that error factor is outside the margin of error.
Phrasing Bias
You can get significant differences in polls based on how the question is phrased. "Who are you going to vote for?" will likely generate different results from "Are you going to vote for Obama or McCain?", which will likely generate different results from "Do you plan to vote for McCain or Obama?", which will generate different results from "Do you plan to vote for a Democrat or a Republican in the presidential election?". This is a well-known problem, but it still has a significant effect.
Dishonest Answers
People aren't entirely trustworthy. They don't necessarily answer questions honestly. A frequently discussed version of this is called the Bradley effect. The Bradley effect is a phenomenon where people are reluctant to admit to being racist. So when a pollster asks them if they're going to vote for a black man, they'll say "yes", but when it actually comes to voting, they'll vote for the white guy. I've heard some people speculate on a reverse Bradley effect this year in some southern states, where people are reluctant to admit that they're going to vote for a black man, so they lie and say they're voting McCain. But the truth of the matter is, we don't know if the people answering the polls are answering honestly. If they're not, that skews the poll results, and once again, it's not covered by the margin of error.

Comments

1
...how large an interval you need to define on either side of the measured statistic to be 95% certain that the "actual" value is within that interval.

"95% certain"? How Bayesian of you.

Here's a frequentist/sampling theory rephrasing: "...how large an interval you need to define on either side of the measured statistic such that 95% of intervals in hypothetical repetitions of the poll would contain the "actual" value."

It turns out Bayesian intervals more-or-less coincide with the confidence intervals for polls with large sample sizes, so both statements are correct.

Posted by: Canuckistani | October 28, 2008 10:15 PM

2

A great explanation, but I always thought Yes, Prime Minister did it the best: http://www.youtube.com/watch?v=2yhN1IDLQjo

Posted by: Funkopolis | October 28, 2008 10:38 PM

3

Can someone please comment on the "Probability of Leading" concept...that if Candidate A leads Candidate B in a poll by 52% to 48% with a margin of error of say, 2%...then what is the actual probability that Candidate A is ahead or leading in the polls?

Posted by: Jeff Williams | October 28, 2008 10:51 PM

4

That was very interesting.

However, for the severely math-deficient like myself, would you apply your comments to the example? Does an eight-point lead with a +/-4% margin of error mean anything much about what is going to happen? Does it mean Obama really is likely to win, or not, or nobody knows?

Posted by: JuliaL | October 28, 2008 11:40 PM

5

I'm curious as to your take on Nate Silver's methods - see http://www.fivethirtyeight.com

Posted by: llewelly | October 29, 2008 12:06 AM

6

In case you haven't seen it, Peter Norvig's US Election FAQ discusses polls in detail. among other pertinent topics: http://norvig.com/election-faq.html#polls

He includes links to a few sites that give averages of various polls.

Posted by: Yahoopster | October 29, 2008 1:46 AM

7

Re: #3 & #4

There's no mathematical way to know what the real election result is going to be. The best we can do is look at the polls, understanding what the limitations of the sampling method is, and from that make an educated guess about how well the polls reflect reality.

Another commenter pointed to fivethirtyeight.com, which is a site where a guy who is very clued to statistics tries to combine results from all of the available polling, including factoring in weights based on the sampling methods used by the various polls. I think his method looks very good, and that he's likely the most accurate predictor. But we won't know for certain until his method is put to the test, by seeing how well it really matches the final results next week.

Posted by: Mark C. Chu-Carroll | October 29, 2008 7:45 AM

8

Another source of error is that these polls are self-selecting, they do not include the responses of people who refuse to take the poll. I have always thought that they should include that number to judge the magnitude of the error. After all, if they call 2000 people but only get 1000 responses, that has to add to the margin of error.

Posted by: KeithB | October 29, 2008 12:20 PM

9

Dewey vs. Truman, 1948. Dewey's predicted victory arose from telephone polling that selectively excluded Truman's lower class support. A vast number of young adults with little hope of economic ascension are Obama's natural constituency by choice and by counter-reaction. His campaign makes a fetish of collecting cell phone contacts (be first to know his VP choice!). McCain's caterwauling is trench warfare compared to blitzkrieg. Your point about land line polling may be prescient.

Posted by: Unce Al | October 29, 2008 12:25 PM

10

Minor comment: You're missing tags around "Unknown Population".

Posted by: Rupert | October 29, 2008 12:41 PM

11

Mark,

In addition to 538, you might be interested in www.electoral-vote.com. It's run by A. Tanenbaum, and in previous elections it has proved to be very accurate. He also posts a daily, very interesting news summary.

Posted by: MiguelB | October 30, 2008 10:17 AM

12

Uncle Al (haven't seen him in a while) mentioned Dewey vs. Truman. I seem to recall hearing that that problem what was made George Gallup set up shop to minimise this problem of unrepresentative sampling. I just thought that happened a lot earlier.

Posted by: Sili | October 30, 2008 3:36 PM

13

Quite a few pollsters are sampling cell phones. Nate at 538 has looked specifically at this effect.

By the way, one major phrasing bias is "McCain, Obama, Barr, Nader". There will be some 3rd party votes, and some might be conservatives rather than liberals. Then there is the phrasing bias when the ballot itself has a very poor user interface. One example is one where voting a straight ticket does not result in a vote for President, and another is where the candidates for one office are spread over two columns (leading to a multiple selection mis-vote).

Posted by: CCPhysicist | October 30, 2008 5:09 PM

14

Re 1948 polls

Although it is true that there was a sampling bias in Gallups' polling that hear, the biggest factor in the polls failure was that his organization stopped polling several weeks before the election under the impression that Truman was too far behind to catch up. He thus missed the late surge of Democratic voters. He did not make the same mistake in 1968 where polls up thru the 3rd week in October showed Nixon with close to a double digit lead. However, in his final poll, taken the day before the election, he detected the surge toward Humphrey and accurately predicted that the race would be close. In fact, the race was decided in Nixons' favor by some 50,000 votes in his home state of California.

Posted by: SLC | November 3, 2008 8:35 AM

15

There's another error often made when talking about polls. When the results are "within the margin of error", then people assume that the actual number has no meaning. For instance, a poll result asking whether a particular ballot measure will pass: 51% in favor (2% MOE) is treated the same as 49% in favor (2% MOE); but actually, if there are no systematic errors in the poll (of the sort described in Mark's post), the former indicates a much greater chance that the measure will pass.

Posted by: Carl Witty | November 5, 2008 4:24 PM

16

I believe the poll which was distorted due to reliance on the telephone was actually the Literary Digest poll of 1936, which predicted a victory by Alf Landon.

Posted by: Steve Morrison | November 8, 2008 11:49 PM

17

"...if there are no systematic errors in the poll (of the sort described in Mark's post), the former indicates a much greater chance that the measure will pass."

How do you KNOW???? If I flip an unbiased coin or use an unbiased random number generator set to 50%, and flip it enough times so that my MOE is 2%, then there will be MANY instances where I will get heads 51% of the time. (Let's say between 50.5% and 51.5%).

So if you get a result that shows a 51% result with a 2% margin of error, HOW DO YOU KNOW the true result?

Let's look at it this way. I create a program that generates TWO trials of 1000 coin flips each. In the first trial, I've set the random number generator to land on heads 50% of the time in the first trial, and to land on heads 52% of the time in the second trial. I run the program, and BOTH trials spit out results that show that heads came up 51% of the time. If I do not label the results, then HOW are you going to identify which result was generated by which trial?

Posted by: Jeff Williams | November 21, 2008 1:34 PM

Post a Comment

(Email is required for authentication purposes only. On some blogs, comments are moderated for spam, so your comment may not appear immediately.)






Stats

ScienceBlogs

Search ScienceBlogs:

Go to:

Advertisement
Advertisement

© 2006-2009 Seed Media Group LLC. ScienceBlogs is a registered trademark of Seed Media Group. All rights reserved.

Sites by Seed Media Group: Seed Media Group | ScienceBlogs | SEEDMAGAZINE.COM