Car Dealers and Correcting for Multiple Tests

A few weeks ago, there was a minor ruckus about a post that claimed that the shutdown of Chrysler dealerships was biased to protect dealers who supported Clinton's campaign. At the time, I ignored it, figuring it was just another case of Conservative Clinton Derangement Syndrome (BILL CLINTON'S PENIS!!! BILLARY IS A SHE DEVIL!! AAAIIIEEE!!!). But I finally got around to reading it, and guess what?

It's stupid, and based on a sloppy understanding of probability theory.

The authors of the post performed multiple linear regressions to determine if dealerships owned by campaign supporters of various presidential candidates were more or less likely to be closed. What they found is that the regression for Clinton supporters, which indicated that Clinton supporters were more likely to survive the shutdown, had a p-value of 0.125.

A "p-value" is the probability that a given observation occurred by chance and is not a 'real' phenomenon. So, in this case, there is an 87.5% (1 - 0.125) probability that the bias towards Clinton supporters is not a sampling accident.

ZOMG!!! TEH CLINTONZ R EVULS!! Except for one thing: multiple tests were performed, and a correction was not applied. What this means in English is that if you perform enough statistical comparisons, eventually one will be significant by chance alone. This is why, when polls are discussed, people will often claim that one out of twenty answers is wrong*. (Polls typically use the scientifically accepted p-value of 0.05, meaning that, on average, one of twenty (five percent) of significant differences in a poll is unfounded and occurred by chance).

The authors, in an update, recognize this, and remark (italics mine):

A word about multiple experiments:

We found what I will call the "Clinton Effect" after running the data in separate regressions just with Clinton, Obama, McCain. The rest of the variables we added in later testing. One could make the argument that Zero Hedge was "data mining" or "fishing" with multiple experiments eventually bound to find something. Readers will have to judge the import of this observation for themselves.

Actually, we don't have to "judge the import" of anything, we can use math to figure this out. If we make N comparisons, the probability that one or more observations will have a p-value of p is equal to:

1-(1-p)N

If one performs three comparisons (and I think the 'minimum' set should have been four, and have included the 'None'--no donation--category), there is a 33% chance of observing one or more tests with a p-value of 0.125. With four tests, it increases to 41%. Five tests, you get 49%, and six tests, yields 55%.

So, judging the import of these findings, I say they're bullshit.

There is an interesting scientific problem here. When you conduct large-scale surveys, such as are done in human genomics, where you look for correlations with hundreds of thousands of genetic markers, finding something that is truly significant becomes very difficult.

It's also humbling to think that, over one's career, if you have conducted many experiments (non-reinforcing ones), that a small fraction of what you think is real is simply artifact.

Eeek.

*This is actually sloppy and incorrect, but it's too much to ask our pundit class to understand statistics. Actually, it's a miracle they can even dress themselves.

Tags

More like this

Excellent post. [Of course, I suppose if one in twenty posts are excellent .... :) ]

Regarding the "one in twenty is wrong" ... I actually do (also) like the phrase because it is strong pedagogy. It should probably be "one is twenty is wrong but you can't know which one" but that would lead to "if 1 in 20 is wrong and we can't know which one, then everything we know is wrong!!!!" or the opposite "since one in twenty is always wrong we can ignore uncertainty"

On the broader and more important issue you bring up: This is where a couple of other key features of scientific investigation come in.

1) Describe the plausible mechanism and test that independently. In this case, how does the process of favoring Clinton donors (or any donor class) work, and what is the evidence separate from that already adduced for this (and not just randomly pulled out of the butt evidence, but comparative and evaluative evidence); and

2) Replicate. Different data, same test, if possible. A random soup of p-values under replication will tend to provide randomly spurious results.

"When you conduct large-scale surveys, such as are done in human genomics, where you look for correlations with hundreds of thousands of genetic markers, finding something that is truly significant becomes very difficult."

Which I why I like the maxim my stats professor taught me:

"Replication is the most powerful statistic."

If they split the sample and found the same effect twice, then it would be more believable (test-retest reliability)

"It's also humbling to think that, over one's career, if you have conducted many experiments (non-reinforcing ones), that a small fraction of what you think is real is simply artifact."

Which reminds me of another maxim a fellow student taught me:

"Researchers never die; they just run out of degrees of freedom."

A "p-value" is the probability that a given observation occurred by chance and is not a 'real' phenomenon. So, in this case, there is an 87.5% (1 - 0.125) probability that the bias towards Clinton supporters is not a sampling accident.

Aaaagh! No it isn't! A p-value is the probability of getting that statistic, or something more extreme, given the model and estimated parameters are correct. If you want to decide it's a real phenomenon, you have to take into account the prior probabilities of the different possibilities. That way leads to Bayesianism, and thence to drunkeness in Spain.

I deal with this issue on an ongoing basis with brand new grad students when they come show me their first data analyses.

STUDENT: Look! I measured this physiological paramater under these eight different experimental conditions, and this one is different from this other one! YIPPEE!!!! DATA!!!

ME: Great! How did you compare the different conditions?

STUDENT: I performed t-tests on all the possible pairs conditions to see which were different from which. LOOK! p < 0.05!!!! YIPPEEE!!!!!!!

ME (rolling eyes): Grasshopper, let me teach you the ways of the ANOVA and the control of experiment-wise p.