My friend, fellow ScienceBlogger, and BlogFather Orac asked me to take a look at
href="http://www.jpands.org/vol12no3/carroll.pdf">a paper that purportedly shows that abortion is a
causative risk factor for breast cancer, which he href="http://scienceblogs.com/insolence/2007/10/abortion_and_breast_cancer_the_chicago_t.php">posted about
this morning. When the person who motivated me to start what’s turned out to be a shockingly
successful blog asks for something, how could I possibly say no? Especially when it’s such a great example
of the misuse of mathematics for political purposes?
The paper is “The Breast Cancer Epidemic: Modeling and Forecasts Based on Abortion and Other Risk
Factors”, by Patrick S. Carroll, published in the Journal of American Physicians and Surgeons (JPANDS).
Before getting to the meat of the paper, there’s a couple of preliminary things to say about it.
In an ideal world, every time you read a paper, you’d study every bit of it in great, absorbing detail. But in the real world, you can’t do that. There are too many papers; if you tried to give every paper a full and carefully detailed reading, even if you never stopped to eat and sleep,
you’d be falling further behind every day. So a major skill that you acquire when you learn
to do research is how to triage, and decide how much attention to give to different kinds of papers.
One thing that you should always consider when you set out to look at a paper is look at
its conclusions. In general, there are a few basic kinds of papers. There are papers presenting
entirely new information; there are papers that are adding something new to an established
consensus; there are papers that are just piling more data onto an established consensus; and there
are papers that are refuting an established consensus. The way that you read a paper depends on
what kind of paper it is.
If a paper is just piling on more evidence, you look at the data that it presents – and don’t pay a lot attention to much else, because they’re rehashing what’s already been said. The only really interesting thing in a paper like that is the data. So you focus your attention on the data, how it
was gathered and how it was analyzed, and what (if anything) it adds to what we already know.
If a paper adds something new to a consensus, then you give it more careful attention. You’re
still focused primarily on the data, but you also want to carefully look at how the data was
gathered and analyzed, to see if the new information that they’re adding is valid.
The first and last kinds of paper: the ones that present something totally new, and the ones that refute something for which there is a lot of strongly supported data, you read with
much greater care and attention to detail. These are the papers that make the strongest claims,
and which haven’t been carefully looked at by many different people yet, so they require the most
careful attention and analysis. This paper is a member of that last class: it’s claiming to find
a statistical link which many careful studies have not found. So it’s in line for a very
So what’s the source? Well, it’s published in JPANDS. JPANDS in a terrible journal. In fact, the first post on Good Math/Bad Math was a critique of a JPANDS paper that used some of the worst statistics that I’ve ever seen published. That’s bad – the paper is appearing in a very low-credibility journal with a history of not carefully reviewing statistical analysis. That’s certainly not enough
to justify ignoring the paper – but the quality of the journal is a valid consideration. A paper about
this topic that appears in a prestigious cancer our epidemiology journal has more credibility than
a paper that appears in a journal known for publishing garbage. It, quite naturally, brings
to mind the question “Why publish this work in a non-MEDLINE indexed, low quality journal?” Like I said, it’s not enough to ignore the paper, but it does raise red flags right away: this is a paper where you’re going to have to give the data and its analysis a very careful read.
So, on to the paper. What the paper does is select a set of potential risk factors for breast cancer, and then compare the incidence of those risk factors in a group of populations with the incidence of
breast cancer in those same populations. That’s a sort-of strange approach. At best, that approach can
show a statistical correlation, but it’s going to be a weak one – because it doesn’t maintain any link
between individuals with risk factors and the incidence of disease. In general, you use
a correlative study like that when you can’t associate risk factors and incidences with specific
individuals. The author does address this point: he says that it’s difficult for epidemiologists to
obtain information about whether a particular woman had an abortion. So that addresses that criticism, but
the fact remains that it’s going to be much harder to establish a causal link rather than a correlative link using this methodology.
To try to build a model, he selects a list of 7 risk factors: abortion, higher age at first live
birth, childlessness, number of children, breastfeeding, hormonal contraceptive use, and hormone
replacement therapy. This list raises some red flags. It omits a large number of well-known risk factors
which could easily outweigh the factors that are included in the list: smoking, alchohol, genetic risk,
race. (Orac has more to say about that.) But what’s also important to notice is that these factors are
not independent. The number of women who breastfeed are, obviously, strongly correlated with the
number who’ve had children. The women who have a large number of children are much more likely to have
their first child at a younger age than the women who had only one or two children. And it ignores
important correlative factors: higher income women tend to have fewer children, later age at first birth,
and higher rates of breastfeeding. This list looks fishy.
But what comes next is where things just totally go off the rails. He takes the 7 risk factors,
and using information from public health services, does a linear regression of risk factor versus cancer incidence over time. If the linear regression doesn’t produce a strong positive correlation, he throws it away. The fact that this means that he’s asserting that well-known and well-supported
correlations should be discarded as invalid isn’t even mentioned. But what’s worse is, it’s clearly quite
On page two, he shows a graph of the data for “mean age of first live birth” plotted against breast
cancer risk. How does he assemble the graph for the linear regression? For each year, he takes the
complete set of women born that year. Then he computes the average age of first birth for all women born
that year, and tries to correlate it with the breast cancer incidence for women born in that year. That’s
ridiculous. It is a completely unacceptable and invalid use of statistics. Anyone who’s
even taken a college freshman course in stats should know that that is absolutely ridiculous. It’s very
deliberately ignoring independence from other variables, in obviously foolish ways. I just don’t even
know how to mock this, because it’s so off-the-wall ridiculous.
There’s another obvious problem with the whole methodology, which pales in comparison to
the dreadful way that they selected data. But I’ll mention it anyway. Linear regression and correlation
coefficient measures how well a linear relationship matches the data. It doesn’t test for
anything else. But there are numerous correlative and/or causal relationships that don’t show a
simple linear relationship. For example, if you look at alcohol consumption plotted against
various diseases, there’s often an initial decrease in risk, which bottoms out and is followed by a large increase in risk. There are often threshold effects, where something doesn’t start
to have an impact until beyond a minimum threshold. And so on. There’s a lot more to things that
just linear correlation. But all that the author considers is linear correlation. He gives no reason
for that, and makes no attempt to justify it. It’s just presented as if it’s beyond question.
Based on those linear regressions, he totally discards everything without a strong linear correlation
as being irrelevant factors that don’t need to be included in the model. That
leaves him with only two factors: fertility (number of live births) and abortion. So then, once
again building on the assumption that linear relationships are the only things that matter, he says
that they can model the breast cancer incidence via a simple linear equation:
Yi = a + b1x1i + b2x2i + ei
In this, Yi is the breast cancer incidence in the group of women of age i;
x1i is a measure of the number of abortions; x2i is a measure of fertility.
They then do another linear regression using this equation to come up with coefficients for
the two measured quantities. The coefficient for fertility is -0.0047, with a 95% confidence interval
ranging from -0.0135 to _0.0041. In other words, according to their measure, fertility – the rate
of live birth – is not a significant factor in breast cancer rates compared to
Right there, we can stop looking at the paper. When a mathematical model generates an
incredibly ridiculous result, something which is in direct and blatant contradiction with
the known data, you throw the model right out the window, because it’s worthless. The notion that
abortion as a risk factor for breast cancer completely dwarfs the reduction in risk after
childbirth – when we know that having children causes a dramatic decrease in the risk of
breast cancer – is unquestionably wrong. If it were true, what it would mean is that the
number of cases of breast cancer among women who had no children but had an abortion (which,
from what I can estimate from data from a variety of websites is somewhere around 15%) is
so high that it can completely dwarf the risk reduction among women who did
have children (>80%). If that were the case, it would be incredibly obvious in the statistics
of breast cancer rates – you’d have a small sub-population causing an inordinately huge
portion of the breast cancer rates. We know that things like that are easily visible: that’s how
we discovered the so-called “breast cancer genes” – a small group of women were
dramatically more likely to have breast cancer that the population at large.
So we’ve got a model which doesn’t fit reality. What a real scientist does when this happens
is to say “Damn, I was wrong. Back to the ol’ drawing board”, and try to find a new model that
does fit with reality.
But not this intrepid author. He tries to handwave his way past the fact that his model is
wrong, by saying “The coefficient of fertility is rather small, with the 95% confidence interval straddling zero. Some improvement in breastfeeding may be offsetting fertility decline.” No, sorry, you can’t say “My mathematical model has absolutely no relation with reality, but that’s probably because one of the factors that I excluded is probably important, and so now I’m going to go on pretending that the model works.”
The model is wrong. Invalid models to not produce valid results. Stop. Do not pass go. Do not collect $200. Do not get your paper published in a decent journal. Do get laughed at by people who aren’t clueless jackasses.
At this point, we can see just why this paper appeared in a journal like JPANDS. Because it’s
crap that’s just attempting to justify a political position using incredibly sloppy math; math so
bad that a college freshman should be able to see what’s wrong with it. But for the “reviewers” at JPANDS, apparently a college freshman level of knowledge of statistics isn’t necessary for reviewing
a paper on statistical epidemiology.