My friend, fellow ScienceBlogger, and BlogFather Orac asked me to take a look at a paper that purportedly shows that abortion is a

causative risk factor for breast cancer, which he posted about

this morning. When the person who motivated me to start what’s turned out to be a shockingly

successful blog asks for something, how could I possibly say no? Especially when it’s such a great example

of the misuse of mathematics for political purposes?

The paper is “The Breast Cancer Epidemic: Modeling and Forecasts Based on Abortion and Other Risk

Factors”, by Patrick S. Carroll, published in the Journal of American Physicians and Surgeons (JPANDS).

Before getting to the meat of the paper, there’s a couple of preliminary things to say about it.

In an ideal world, every time you read a paper, you’d study every bit of it in great, absorbing detail. But in the real world, you can’t do that. There are too many papers; if you tried to give every paper a full and carefully detailed reading, even if you never stopped to eat and sleep,

you’d be falling further behind every day. So a major skill that you acquire when you learn

to do research is how to triage, and decide how much attention to give to different kinds of papers.

One thing that you should always consider when you set out to look at a paper is look at

its conclusions. In general, there are a few basic kinds of papers. There are papers presenting

entirely new information; there are papers that are adding something new to an established

consensus; there are papers that are just piling more data onto an established consensus; and there

are papers that are refuting an established consensus. The way that you read a paper depends on

what kind of paper it is.

If a paper is just piling on more evidence, you look at the data that it presents – and don’t pay a lot attention to much else, because they’re rehashing what’s already been said. The only really interesting thing in a paper like that is the data. So you focus your attention on the data, how it

was gathered and how it was analyzed, and what (if anything) it adds to what we already know.

If a paper adds something new to a consensus, then you give it more careful attention. You’re

still focused primarily on the data, but you also want to carefully look at how the data was

gathered and analyzed, to see if the new information that they’re adding is valid.

The first and last kinds of paper: the ones that present something totally new, and the ones that refute something for which there is a lot of strongly supported data, you read with

much greater care and attention to detail. These are the papers that make the strongest claims,

and which haven’t been carefully looked at by many different people yet, so they require the most

careful attention and analysis. This paper is a member of that last class: it’s claiming to find

a statistical link which many careful studies have *not* found. So it’s in line for a very

careful reading.

So what’s the source? Well, it’s published in JPANDS. JPANDS in a *terrible* journal. In fact, the first post on Good Math/Bad Math was a critique of a JPANDS paper that used some of the worst statistics that I’ve ever seen published. That’s bad – the paper is appearing in a very low-credibility journal with a history of not carefully reviewing statistical analysis. That’s certainly not enough

to justify ignoring the paper – but the quality of the journal is a valid consideration. A paper about

this topic that appears in a prestigious cancer our epidemiology journal has more credibility than

a paper that appears in a journal known for publishing garbage. It, quite naturally, brings

to mind the question “Why publish this work in a non-MEDLINE indexed, low quality journal?” Like I said, it’s not enough to ignore the paper, but it does raise red flags right away: this is a paper where you’re going to have to give the data and its analysis a very careful read.

So, on to the paper. What the paper does is select a set of potential risk factors for breast cancer, and then compare the incidence of those risk factors in a group of populations with the incidence of

breast cancer in those same populations. That’s a sort-of strange approach. At best, that approach can

show a statistical correlation, but it’s going to be a weak one – because it doesn’t maintain any link

between *individuals* with risk factors and the incidence of disease. In general, you use

a correlative study like that when you *can’t* associate risk factors and incidences with specific

individuals. The author does address this point: he says that it’s difficult for epidemiologists to

obtain information about whether a particular woman had an abortion. So that addresses that criticism, but

the fact remains that it’s going to be much harder to establish a causal link rather than a correlative link using this methodology.

To try to build a model, he selects a list of 7 risk factors: abortion, higher age at first live

birth, childlessness, number of children, breastfeeding, hormonal contraceptive use, and hormone

replacement therapy. This list raises some red flags. It omits a large number of well-known risk factors

which could easily outweigh the factors that are included in the list: smoking, alchohol, genetic risk,

race. (Orac has more to say about that.) But what’s also important to notice is that these factors are

*not* independent. The number of women who breastfeed are, obviously, strongly correlated with the

number who’ve had children. The women who have a large number of children are much more likely to have

their first child at a younger age than the women who had only one or two children. And it ignores

important correlative factors: higher income women tend to have fewer children, later age at first birth,

and higher rates of breastfeeding. This list looks fishy.

But what comes next is where things just totally go off the rails. He takes the 7 risk factors,

and using information from public health services, does a linear regression of risk factor versus cancer incidence over time. If the linear regression doesn’t produce a strong positive correlation, *he throws it away*. The fact that this means that he’s asserting that well-known and well-supported

correlations should be discarded as invalid isn’t even mentioned. But what’s worse is, it’s clearly quite

deliberate.

On page two, he shows a graph of the data for “mean age of first live birth” plotted against breast

cancer risk. How does he assemble the graph for the linear regression? For each year, he takes the

complete set of women born that year. Then he computes the average age of first birth for all women born

that year, and tries to correlate it with the breast cancer incidence for women born in that year. That’s

*ridiculous*. It is a *completely* unacceptable and invalid use of statistics. Anyone who’s

even taken a college freshman course in stats should know that that is absolutely ridiculous. It’s very

deliberately ignoring independence from other variables, in obviously foolish ways. I just don’t even

know how to mock this, because it’s so off-the-wall ridiculous.

There’s another obvious problem with the whole methodology, which pales in comparison to

the dreadful way that they selected data. But I’ll mention it anyway. Linear regression and correlation

coefficient measures how well a *linear relationship* matches the data. It doesn’t test for

anything else. But there are numerous correlative and/or causal relationships that don’t show a

simple linear relationship. For example, if you look at alcohol consumption plotted against

various diseases, there’s often an initial *decrease* in risk, which bottoms out and is followed by a large *increase* in risk. There are often threshold effects, where something doesn’t start

to have an impact until beyond a minimum threshold. And so on. There’s a lot more to things that

just linear correlation. But all that the author considers is linear correlation. He gives no reason

for that, and makes no attempt to justify it. It’s just presented as if it’s beyond question.

Based on those linear regressions, he totally discards everything without a strong linear correlation

as being irrelevant factors that don’t need to be included in the model. That

leaves him with only two factors: fertility (number of live births) and abortion. So then, once

again building on the assumption that linear relationships are the only things that matter, he says

that they can model the breast cancer incidence via a simple linear equation:

Y

_{i}= a + b

_{1}x

_{1i}+ b

_{2}x

_{2i}+ e

_{i}

In this, Y_{i} is the breast cancer incidence in the group of women of age *i*;

x_{1i} is a measure of the number of abortions; x_{2i} is a measure of fertility.

They then do another linear regression using this equation to come up with coefficients for

the two measured quantities. The coefficient for fertility is -0.0047, with a 95% confidence interval

ranging from -0.0135 to _0.0041. In other words, according to their measure, fertility – the rate

of live birth – is *not* a significant factor in breast cancer rates compared to

abortion.

Right there, we can stop looking at the paper. When a mathematical model generates an

incredibly ridiculous result, something which is in direct and blatant contradiction with

the known data, you throw the model right out the window, because it’s worthless. The notion that

abortion as a risk factor for breast cancer completely dwarfs the reduction in risk after

childbirth – when we know that having children causes a dramatic decrease in the risk of

breast cancer – is unquestionably wrong. If it were true, what it would mean is that the

number of cases of breast cancer among women who had no children but had an abortion (which,

from what I can estimate from data from a variety of websites is somewhere around 15%) is

*so high* that it can completely dwarf the risk reduction among women who did

have children (>80%). If that were the case, it would be incredibly obvious in the statistics

of breast cancer rates – you’d have a small sub-population causing an inordinately huge

portion of the breast cancer rates. We know that things like that are easily visible: that’s how

we discovered the so-called “breast cancer genes” – a small group of women were

dramatically more likely to have breast cancer that the population at large.

So we’ve got a model which doesn’t fit reality. What a real scientist does when this happens

is to say “Damn, I was wrong. Back to the ol’ drawing board”, and try to find a new model that

*does* fit with reality.

But not this intrepid author. He tries to handwave his way past the fact that his model is

wrong, by saying “The coefficient of fertility is rather small, with the 95% confidence interval straddling zero. Some improvement in breastfeeding may be offsetting fertility decline.” No, sorry, you can’t say “My mathematical model has absolutely no relation with reality, but that’s probably because one of the factors that I excluded is probably important, and so now I’m going to go on pretending that the model works.”

The model is *wrong*. Invalid models to not produce valid results. Stop. Do not pass go. Do not collect $200. Do not get your paper published in a decent journal. *Do* get laughed at by people who aren’t clueless jackasses.

At this point, we can see just why this paper appeared in a journal like JPANDS. Because it’s

crap that’s just attempting to justify a political position using incredibly sloppy math; math so

bad that a college freshman should be able to see what’s wrong with it. But for the “reviewers” at JPANDS, apparently a college freshman level of knowledge of statistics isn’t necessary for reviewing

a paper on statistical epidemiology.