I was going to give this a rest for a while, but this is too good not to post a brief note about.
Posted in the comments of my piece debunking the Geiers' pseudoscience and their laughable "scientific" article claiming to show a decrease in the rate of new cases of autism since late 2002, when thimerosal was removed from vaccines completely other than some flu vaccines was this gem of a comment, by one MarkCC, which stated the essence of what was wrong with the Geiers' so-called "statistical analysis" of the VAERS database:
Here's the key, fundamental issue: when you're doing statistical analysis, you don't get to look at the data and choose a split point. What the Geiers did is to look at the data, and find the best point for splitting the dataset to create the result they wanted. There is no justification for choosing that point except that it's the point that produces the result that they, a priori, decided they wanted to produce.Time trend analysis is extremely tricky to do - but the most important thing in getting it right is doing it in a way that eliminates the ability of the analysis to be biased in the direction of a particular a priori conclusion. (In general, you do that not to screen out cheaters, but to ensure that whatever correlation you're demonstrating is real, not just an accidental correlation created by the human ability to notice patterns. It's very easy for a human being to see patterns, even where there aren't any.)
Redo the Geiers analysis using any decent time-trend analysis technique - even a trivial one like doing multiple overlapping three-year regressions (i.e., plot the data from '92 to '95, '93 to '96, '94 to '97, etc) and you'll find that that nice clean break point in the data doesn't really exist - you'll get a series of trend lines with different slopes, without any clear break in slope or correlation.
So - to sum up the problem in one brief sentence: in statistical time analysis, you do not get to pick break points in the time sequence by looking at the data and choosing the break point that is most favorable to your desired conclusion.
Exactly! Unfortunately, that's exactly what the Geiers did.
A proper statistical analysis of such data, looking for time points at which a rate of change in a variable changes, is designed such that there is no bias in selecting a time point at which a significant change in slope is observed. As much as the Geiers might want to believe that there is a marked change in the slope of the curve beginning around late 2002 to early 2003, they can't assume that there is such a breakpoint before doing the analysis.
Once again, what pseudoscientists like the Geiers never seem to understand is that all those precautions we scientists take with control groups and statistical analyses designed to minimize investigator bias exist because we realize how easy it is for a scientist, particularly a medical scientist who is invested in finding a cure for a particular disease or condition, to be seduced into believing something that is not supported by data. (If they did understand, they wouldn't use such simplistic and easily debunked "scientific" methodology.) It's a very human tendency, and the scientific method is designed to minimize that tendency. That's why it takes so much training to overcome.
Some scientists never do overcome this tendency, and if they fall deeply enough into belief over evidence they become pseudoscientists.
Like the Geiers.
Thanks, MarkCC.
In fact, what the Geiers did was a textbook case of the "Texas sharpshooter fallacy," so named from a possibly apocryphal story about a Texan who brags about his target-shooting ability. He stands way back from the side of a barn, fires wildly hitting it all over the place, and then draws a target around the places he hit.
The point is that in statistics, you can't use the same set of data to generate a hypothesis and then test that hypothesis; if you do so, you're reasoning in a circle. Introductory stats courses do a really bad job of explaining this, just saying "don't look at the data before testing it" which often leads students to something similar to the New Age "interpretation" of quantum mechanics.
Exactly, ebohlman - that's why in microarray expts or any other sort of biomarker studies we first have a 'training set' and then test the hypothesis in a completely different experimental set of subjects.
Ever see a picture of Dr. Geier?
Does he look like Dracula? Always wanted to meet Dracula.
While I wholeheartedly agree that the abuse of statistics through reading data before positing a hypothesis is dead wrong, there exists a whole field - data analysis - that superficially does just that. Under these circumstances, a respectable scientist has to draw attention to any pitfalls that he/she may perceive when applying inferential procedures and reaching conclusions 'a posteriori'.
Glad you unearthed this gem from the comment pile: I missed it.
That's a key distinction, isn't it? Real scientists are very careful to qualify the limitations of their methodology, particularly when doing retrospective analyses, which by their very nature are much more prone to bias and incorrect conclusions than prospective studies--even when the data used isn't as questionable as what is contained in the VAERS database. Pseudoscientists don't bother to list the limitations of their analysis or only do so in a very perfunctory fashion, mainly because they don't want to weaken their conclusion, which was usually reached before they ever looked at the data.
The bottom line is that correlation does not equal causation, and the Geiers haven't even been able to demonstrate correlation convincingly.
While I agree with the sentiment of the above statement, I do think a few things ought to be clarified. First, you can choose a split point a priori, e.g. the stock market crashed on March 4, let's collect data on stock prices and see whether the Mar 4 crash affected it. I didn't see where the Geiers spelled out whether the chose their point before seeing the data or not. Given that VAERS and CDDS data are public, I conservatively assume not. I expect to see some formal hypothesis test that directly addresses the split point. The so-called interrupted time series methods are one good class of methods, although I looked at their CDDS data and decided it didn't need time series analysis after all (no autocorrelation or partial autocorrelation). You can also do some regression model-building techniques to address this hypothesis test. Of course, the Geiers tried to justify their change point by looking at the slopes of two lines and comparing, and well, that was rather bizarre.
The other thing to note is that the Geiers did not really perform a changepoint analysis. Take a look again. They overlap their two regression lines by a year. I certainly haven't seen that in the 15 years that I have studied and done statistics.
This paper was certainly not a red letter day for statistics.