This past weekend, my friend Orac sent me a link to an interesting piece

of bad math. One of Orac’s big interest is vaccination and

anti-vaccinationists. The piece is a newsletter by a group calling itself the “Sound Choice

Pharmaceutical Institute” (SCPI), which purports to show a link

between vaccinations and autism. But instead of the usual anti-vac rubbish about

thimerosol, they claim that “residual human DNA contamintants from aborted human fetal cells”

causes autism.

Among others, Orac already covered the nonsense

of that from a biological/medical

perspective. What he didn’t do, and why he forwarded this newsletter to me, is because

the basis of their argument is that they discovered key change points in the

autism rate that correlate perfectly with the introduction of various vaccines.

In fact, they claim to have discovered three different inflection points:

- 1979, the year that the MMR 2 vaccine was approved in the US;
- 1988, the year that a 2nd dose of the MMR 2 was added to the recommended vaccination

schedule; and - 1995, the year that the chickenpox vaccine was approved in the US.

They claim to have discovered these inflection points using “iterative hockey stick analysis”.

First of all, “hockey stick analysis” isn’t exactly a standard

mathematical term. So we’re on shaky ground right away. They describe

hockey-stick analysis as a kind of “computational line fitting analysis”. But

they never identify what the actual method is, and there’s no literature on

exactly what “iterative hockey stick analysis” is. So I’m working from a best

guess. Typically, when you try to fit a line to a set of data points,

you use a technique called linear regression. The most common linear regression method is

called the “least squares” method, and their graphs look roughly like least-squares

fitting, so I’m going to assume that that’s what they use.

What least squares linear regression do is pretty simple – but it takes a

bit of explanation. Suppose you’ve got a set of data points where you’ve got

good reason to believe that you’ve got one independent variable, and one

dependent variable. Then you can plot those points on a standard graph, with

the independent variable on the x axis, and the dependent variable on the y

axis. That gives you a scattering of points. If there really is a linear

relationship between the dependent and independent variable, and your

measurements were all perfect, with no confounding factors, then the points would

fall on the line defined by that linear relationship.

But nothing in the real world is ever perfect. Our measurements always

have some amount of error, and there are always confounding factors. So

the points *never* fall perfectly along a line. So we want some way of

defining the *best fit* to a set of data. That is, understanding that there’s

noise in the data, what’s the line that comes closest to describing a linear relationship.

Least squares is one simple way of describing that. The idea is that the

best fit line is the line where, for each data point, you take the difference

between the predicted line and the actual measurement. You square that

difference, and then you add up all of those squared differences. The line

where that sum is *smallest* is the best fit. I’l avoid going into detail about

why you square it – if you’re interested, say so in the comments, and maybe I’ll write a basics

post about linear regression.

One big catch here is that least-squares linear regression produces a good result

*if* the data really has a linear relationship. If it doesn’t, then least squares

will produce a lousy fit. There are lots of other curve fitting techniques, which work in

different ways. If you want to treat your data as perfect, you can use different techniques to

progressively fit the data better and better until you have a polynomial curve which

precisely includes every datum in your data set. You can start with fitting a line to two points; for

every two points, there’s a line connecting them. Then for three points, you can fit them precisely

with a quadratic curve. For four points, you can fit them with a cubic curve. And so on.

Similarly, unless your data is perfectly linear, you can *always* improve a fit by

partitioning the data. Just like we can fit a curve to two points from the set; then get closer

by fitting it to three; then closer by fitting it to four, we can fit two lines to a 2 way partition

of the data, and get a closer match; then we can get closer with three lines in a three way partition,

and four lines in a four way partition, and so on, until you have a partition for every pair of adjacent

points.

The key takeaway is that no matter *what* you data looks like, if

it’s not perfectly linear, then you can *always* improve the fit by

creating a partition.

For “hockey stick analysis”, what they’re doing is looking for a good

place to put a partition. That’s a reasonable thing to try to do, but you need

to be really careful about it – because, as I described above, you can

*always* find a partition. You need to make sure that you’re actually

finding a genuine change in the basic relationship between the dependent and

independent variable, and not just noticing a random correlation.

Identifying change points like that is extremely tricky. To identify it,

you need to do a lot of work. In particular, you need to create a large number

of partitions of the data, in order to show that there is one specific

partition that produces a better result than any of the others. And that’s not

enough: you can’t just select one point that looks good, and see if you get a

better match by splitting there. That’s a start: you need to show that the

inflection point that you chose is really the *best* inflection point.

But you also really need to go bayesian, and figure out an estimate of the chance

of the inflection being an illusion, and show that what the quality of the partition

that you found is better than what you would expect by chance.

Finding a partition point like that is, as you can see, not a simple

thing to do. You need a good supply of data: for small datasets, the

probability of finding a good partition is quite high. You need to do

a lot of careful analysis.

In general, trying to find multiple partition points is simply not

feasible unless you have a really huge quantity of data, and the slope change

is really dramatic. I’m not going to go into the details – but it’s basically

just using more Bayesian analysis. You know that there’s a high probability

that adding partitions to your data will increase the match quality. You need

to determine, given the expected improvement from partitioning based on the

distribution of you data, how much better of a fit you’d need to find after

partitioning for it to be reasonably certain that the change wasn’t an

artifact.

Just to show that there’s one genuine partition point, you need to show a

pretty significant change. (Exactly how much depends on how much data you

have, what kind of distribution it has, and how well it correlates to the line

match.) But you can’t do it for small changes. To show two genuine change points

requires an extremely robust change at both points, along with showing that

non-linear matches aren’t better that the multiple slope changes. To show

*three* inflection points is close to impossible; if the slope is

shifting that often, it’s almost certainly not a linear relationship.

To get down to specifics, the data set purportedly analyzed by SCPI

consists of autism rates measured over 35 years. That’s just *thirty
five* data points. The chances of being able to reliably identify

*one*slope change in a set of 35 data points is slim at best. Two?

ridiculous. Three? Beyond ridiculous. There’s just nowhere

*near*

enough data to be able to claim that you’ve got three different inflection

points measured from 35 data points.

To make matters worse: the earliest data in their analysis comes from a

*different* source than the latest data. They’ve got some data from the

US Department of Education (1970->1987), and some data from the California

Department of Developmental Services (1973->1997). And those two are measuring

*different* things; the US DOE statistic is based on a count of the number of 19

year olds who have a diagnosis of autism (so it was data collected in 1989 through 2006);

the California DDS statistic is based on the autism diagnosis rate for children living in

California.

So – guess where one of their slope changes occurs? Go on, guess.

1988.

The slope changed in the year when they switched from mixed data to

California DDS data exclusively. Gosh, you don’t think that that might be a

confounding factor, do you? And gosh, it’s by far the largest (and therefore

the most likely to be real) of the three slope changes they claim to

have identified.

For the third slope change, they don’t even show it on the same graph. In

fact, to get it, they needed to use an *entirely different* dataset from

either of the two others. Which is an interesting choice, given that the CA DDS

statistic that they used for the second slope change, actually appears

to show a *decrease* occurring around 1995. But when they switch datasets,

ignoring the one that they were using before, they find a third slope change

in 1995 – right when their other data set shows a *decrease*.

So… Let’s summarize the problems here.

- They’re using an iterative line-matching technique which is, at

best, questionable. - They’re applying it to a dataset that is orders of

magnitude too small to be able to generate a meaningful result for a

*single*slope change, but they use it to identify*three*

different slope changes. - They use mixed datasets that measure different things in different ways,

without any sort of meta-analysis to reconcile them. - One of the supposed changes occurs at the point of changeover in the datasets.
- When one of their datasets shows a
*decrease*in the slope, but another

shows an increase, they arbitrarily choose the one that shows an increase.