Car Dealers and Correcting for Multiple Tests

By mikethemadbiologist on June 18, 2009.

A few weeks ago, there was a minor ruckus about a post that claimed that the shutdown of Chrysler dealerships was biased to protect dealers who supported Clinton's campaign. At the time, I ignored it, figuring it was just another case of Conservative Clinton Derangement Syndrome (BILL CLINTON'S PENIS!!! BILLARY IS A SHE DEVIL!! AAAIIIEEE!!!). But I finally got around to reading it, and guess what?

It's stupid, and based on a sloppy understanding of probability theory.

The authors of the post performed multiple linear regressions to determine if dealerships owned by campaign supporters of various presidential candidates were more or less likely to be closed. What they found is that the regression for Clinton supporters, which indicated that Clinton supporters were more likely to survive the shutdown, had a p-value of 0.125.

A "p-value" is the probability that a given observation occurred by chance and is not a 'real' phenomenon. So, in this case, there is an 87.5% (1 - 0.125) probability that the bias towards Clinton supporters is not a sampling accident.

ZOMG!!! TEH CLINTONZ R EVULS!! Except for one thing: multiple tests were performed, and a correction was not applied. What this means in English is that if you perform enough statistical comparisons, eventually one will be significant by chance alone. This is why, when polls are discussed, people will often claim that one out of twenty answers is wrong*. (Polls typically use the scientifically accepted p-value of 0.05, meaning that, on average, one of twenty (five percent) of significant differences in a poll is unfounded and occurred by chance).

The authors, in an update, recognize this, and remark (italics mine):

A word about multiple experiments:

We found what I will call the "Clinton Effect" after running the data in separate regressions just with Clinton, Obama, McCain. The rest of the variables we added in later testing. One could make the argument that Zero Hedge was "data mining" or "fishing" with multiple experiments eventually bound to find something. Readers will have to judge the import of this observation for themselves.

Actually, we don't have to "judge the import" of anything, we can use math to figure this out. If we make N comparisons, the probability that one or more observations will have a p-value of p is equal to:

1-(1-p)^N

If one performs three comparisons (and I think the 'minimum' set should have been four, and have included the 'None'--no donation--category), there is a 33% chance of observing one or more tests with a p-value of 0.125. With four tests, it increases to 41%. Five tests, you get 49%, and six tests, yields 55%.

So, judging the import of these findings, I say they're bullshit.

There is an interesting scientific problem here. When you conduct large-scale surveys, such as are done in human genomics, where you look for correlations with hundreds of thousands of genetic markers, finding something that is truly significant becomes very difficult.

It's also humbling to think that, over one's career, if you have conducted many experiments (non-reinforcing ones), that a small fraction of what you think is real is simply artifact.

Eeek.

*This is actually sloppy and incorrect, but it's too much to ask our pundit class to understand statistics. Actually, it's a miracle they can even dress themselves.

More like this

Excellent post. [Of course, I suppose if one in twenty posts are excellent .... :) ]

Regarding the "one in twenty is wrong" ... I actually do (also) like the phrase because it is strong pedagogy. It should probably be "one is twenty is wrong but you can't know which one" but that would lead to "if 1 in 20 is wrong and we can't know which one, then everything we know is wrong!!!!" or the opposite "since one in twenty is always wrong we can ignore uncertainty"

On the broader and more important issue you bring up: This is where a couple of other key features of scientific investigation come in.

1) Describe the plausible mechanism and test that independently. In this case, how does the process of favoring Clinton donors (or any donor class) work, and what is the evidence separate from that already adduced for this (and not just randomly pulled out of the butt evidence, but comparative and evaluative evidence); and

2) Replicate. Different data, same test, if possible. A random soup of p-values under replication will tend to provide randomly spurious results.

"When you conduct large-scale surveys, such as are done in human genomics, where you look for correlations with hundreds of thousands of genetic markers, finding something that is truly significant becomes very difficult."

Which I why I like the maxim my stats professor taught me:

"Replication is the most powerful statistic."

If they split the sample and found the same effect twice, then it would be more believable (test-retest reliability)

"It's also humbling to think that, over one's career, if you have conducted many experiments (non-reinforcing ones), that a small fraction of what you think is real is simply artifact."

Which reminds me of another maxim a fellow student taught me:

"Researchers never die; they just run out of degrees of freedom."

A "p-value" is the probability that a given observation occurred by chance and is not a 'real' phenomenon. So, in this case, there is an 87.5% (1 - 0.125) probability that the bias towards Clinton supporters is not a sampling accident.

Aaaagh! No it isn't! A p-value is the probability of getting that statistic, or something more extreme, given the model and estimated parameters are correct. If you want to decide it's a real phenomenon, you have to take into account the prior probabilities of the different possibilities. That way leads to Bayesianism, and thence to drunkeness in Spain.

I deal with this issue on an ongoing basis with brand new grad students when they come show me their first data analyses.

STUDENT: Look! I measured this physiological paramater under these eight different experimental conditions, and this one is different from this other one! YIPPEE!!!! DATA!!!

ME: Great! How did you compare the different conditions?

STUDENT: I performed t-tests on all the possible pairs conditions to see which were different from which. LOOK! p < 0.05!!!! YIPPEEE!!!!!!!

ME (rolling eyes): Grasshopper, let me teach you the ways of the ANOVA and the control of experiment-wise p.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

Program Announcement: I'm Moving

September 1, 2011

I've dropped some hints in the past that my relationship with ScienceBlogs would be...altered. Well, I've decided to leave. Mostly, it had to do with the issue of pseudonymity, although I'm very excited to hang out my own shingle once again. I don't want to rehash the issue of pseudonymity,…

Note to Unions: This Is Not How You Build a Coalition

September 1, 2011

The old saw that 'we hang together or we get hung separately' is a perfect description of how the left has disintegrated into irrelevance. Too often, groups will focus on modest gains for their own narrow constituency, while selling out other allies. Over the long term, each component of the…

Links 8/31/11

August 31, 2011

Links for you. Science: Underground river 'Rio Hamza' discovered 4km beneath the Amazon What do accommodationists do about creationist politicians? I've Been Told You Can Get Flu From the Flu Shot: False! Federal Work Suspension of Leading Arctic Scientist Ended as Investigation of His…

Meet the New New Math, Same As the Old New Math? What We Can Learn from Finland

August 31, 2011

Recently, The New York Times published an op-ed calling for curricular changes in K-12 math education: Today, American high schools offer a sequence of algebra, geometry, more algebra, pre-calculus and calculus (or a "reform" version in which these topics are interwoven). This has been codified by…

Links 8/30/11

August 30, 2011

Links for you. Another Scientist Calls Out Sen. Coburn's Misleading, Juvenile "Report" XMRV: ITS EVERYWHERE! UUUUUGH! ITS IN MY RACCOON WOUNDS! AND MY QIAGEN COLUMNS! Coulter Goes All Science-y in Bid to Disprove Evolution Yet another bad day for the anti-vaccine movement 2011 Antibiotics: Killing…