My friend, fellow ScienceBlogger, and BlogFather Orac asked me to take a look at
href="http://www.jpands.org/vol12no3/carroll.pdf">a paper that purportedly shows that abortion is a
causative risk factor for breast cancer, which he href="http://scienceblogs.com/insolence/2007/10/abortion_and_breast_cancer_the_chicago_t.php">posted about
this morning. When the person who motivated me to start what's turned out to be a shockingly
successful blog asks for something, how could I possibly say no? Especially when it's such a great example
of the misuse of mathematics for political purposes?
The paper is "The Breast Cancer Epidemic: Modeling and Forecasts Based on Abortion and Other Risk
Factors", by Patrick S. Carroll, published in the Journal of American Physicians and Surgeons (JPANDS).
Before getting to the meat of the paper, there's a couple of preliminary things to say about it.
In an ideal world, every time you read a paper, you'd study every bit of it in great, absorbing detail. But in the real world, you can't do that. There are too many papers; if you tried to give every paper a full and carefully detailed reading, even if you never stopped to eat and sleep,
you'd be falling further behind every day. So a major skill that you acquire when you learn
to do research is how to triage, and decide how much attention to give to different kinds of papers.
One thing that you should always consider when you set out to look at a paper is look at
its conclusions. In general, there are a few basic kinds of papers. There are papers presenting
entirely new information; there are papers that are adding something new to an established
consensus; there are papers that are just piling more data onto an established consensus; and there
are papers that are refuting an established consensus. The way that you read a paper depends on
what kind of paper it is.
If a paper is just piling on more evidence, you look at the data that it presents - and don't pay a lot attention to much else, because they're rehashing what's already been said. The only really interesting thing in a paper like that is the data. So you focus your attention on the data, how it
was gathered and how it was analyzed, and what (if anything) it adds to what we already know.
If a paper adds something new to a consensus, then you give it more careful attention. You're
still focused primarily on the data, but you also want to carefully look at how the data was
gathered and analyzed, to see if the new information that they're adding is valid.
The first and last kinds of paper: the ones that present something totally new, and the ones that refute something for which there is a lot of strongly supported data, you read with
much greater care and attention to detail. These are the papers that make the strongest claims,
and which haven't been carefully looked at by many different people yet, so they require the most
careful attention and analysis. This paper is a member of that last class: it's claiming to find
a statistical link which many careful studies have not found. So it's in line for a very
So what's the source? Well, it's published in JPANDS. JPANDS in a terrible journal. In fact, the first post on Good Math/Bad Math was a critique of a JPANDS paper that used some of the worst statistics that I've ever seen published. That's bad - the paper is appearing in a very low-credibility journal with a history of not carefully reviewing statistical analysis. That's certainly not enough
to justify ignoring the paper - but the quality of the journal is a valid consideration. A paper about
this topic that appears in a prestigious cancer our epidemiology journal has more credibility than
a paper that appears in a journal known for publishing garbage. It, quite naturally, brings
to mind the question "Why publish this work in a non-MEDLINE indexed, low quality journal?" Like I said, it's not enough to ignore the paper, but it does raise red flags right away: this is a paper where you're going to have to give the data and its analysis a very careful read.
So, on to the paper. What the paper does is select a set of potential risk factors for breast cancer, and then compare the incidence of those risk factors in a group of populations with the incidence of
breast cancer in those same populations. That's a sort-of strange approach. At best, that approach can
show a statistical correlation, but it's going to be a weak one - because it doesn't maintain any link
between individuals with risk factors and the incidence of disease. In general, you use
a correlative study like that when you can't associate risk factors and incidences with specific
individuals. The author does address this point: he says that it's difficult for epidemiologists to
obtain information about whether a particular woman had an abortion. So that addresses that criticism, but
the fact remains that it's going to be much harder to establish a causal link rather than a correlative link using this methodology.
To try to build a model, he selects a list of 7 risk factors: abortion, higher age at first live
birth, childlessness, number of children, breastfeeding, hormonal contraceptive use, and hormone
replacement therapy. This list raises some red flags. It omits a large number of well-known risk factors
which could easily outweigh the factors that are included in the list: smoking, alchohol, genetic risk,
race. (Orac has more to say about that.) But what's also important to notice is that these factors are
not independent. The number of women who breastfeed are, obviously, strongly correlated with the
number who've had children. The women who have a large number of children are much more likely to have
their first child at a younger age than the women who had only one or two children. And it ignores
important correlative factors: higher income women tend to have fewer children, later age at first birth,
and higher rates of breastfeeding. This list looks fishy.
But what comes next is where things just totally go off the rails. He takes the 7 risk factors,
and using information from public health services, does a linear regression of risk factor versus cancer incidence over time. If the linear regression doesn't produce a strong positive correlation, he throws it away. The fact that this means that he's asserting that well-known and well-supported
correlations should be discarded as invalid isn't even mentioned. But what's worse is, it's clearly quite
On page two, he shows a graph of the data for "mean age of first live birth" plotted against breast
cancer risk. How does he assemble the graph for the linear regression? For each year, he takes the
complete set of women born that year. Then he computes the average age of first birth for all women born
that year, and tries to correlate it with the breast cancer incidence for women born in that year. That's
ridiculous. It is a completely unacceptable and invalid use of statistics. Anyone who's
even taken a college freshman course in stats should know that that is absolutely ridiculous. It's very
deliberately ignoring independence from other variables, in obviously foolish ways. I just don't even
know how to mock this, because it's so off-the-wall ridiculous.
There's another obvious problem with the whole methodology, which pales in comparison to
the dreadful way that they selected data. But I'll mention it anyway. Linear regression and correlation
coefficient measures how well a linear relationship matches the data. It doesn't test for
anything else. But there are numerous correlative and/or causal relationships that don't show a
simple linear relationship. For example, if you look at alcohol consumption plotted against
various diseases, there's often an initial decrease in risk, which bottoms out and is followed by a large increase in risk. There are often threshold effects, where something doesn't start
to have an impact until beyond a minimum threshold. And so on. There's a lot more to things that
just linear correlation. But all that the author considers is linear correlation. He gives no reason
for that, and makes no attempt to justify it. It's just presented as if it's beyond question.
Based on those linear regressions, he totally discards everything without a strong linear correlation
as being irrelevant factors that don't need to be included in the model. That
leaves him with only two factors: fertility (number of live births) and abortion. So then, once
again building on the assumption that linear relationships are the only things that matter, he says
that they can model the breast cancer incidence via a simple linear equation:
Yi = a + b1x1i + b2x2i + ei
In this, Yi is the breast cancer incidence in the group of women of age i;
x1i is a measure of the number of abortions;x2i is a measure of fertility.
They then do another linear regression using this equation to come up with coefficients for
the two measured quantities. The coefficient for fertility is -0.0047, with a 95% confidence interval
ranging from -0.0135 to _0.0041. In other words, according to their measure, fertility - the rate
of live birth - is not a significant factor in breast cancer rates compared to
Right there, we can stop looking at the paper. When a mathematical model generates an
incredibly ridiculous result, something which is in direct and blatant contradiction with
the known data, you throw the model right out the window, because it's worthless. The notion that
abortion as a risk factor for breast cancer completely dwarfs the reduction in risk after
childbirth - when we know that having children causes a dramatic decrease in the risk of
breast cancer - is unquestionably wrong. If it were true, what it would mean is that the
number of cases of breast cancer among women who had no children but had an abortion (which,
from what I can estimate from data from a variety of websites is somewhere around 15%) is
so high that it can completely dwarf the risk reduction among women who did
have children (>80%). If that were the case, it would be incredibly obvious in the statistics
of breast cancer rates - you'd have a small sub-population causing an inordinately huge
portion of the breast cancer rates. We know that things like that are easily visible: that's how
we discovered the so-called "breast cancer genes" - a small group of women were
dramatically more likely to have breast cancer that the population at large.
So we've got a model which doesn't fit reality. What a real scientist does when this happens
is to say "Damn, I was wrong. Back to the ol' drawing board", and try to find a new model that
does fit with reality.
But not this intrepid author. He tries to handwave his way past the fact that his model is
wrong, by saying "The coefficient of fertility is rather small, with the 95% confidence interval straddling zero. Some improvement in breastfeeding may be offsetting fertility decline." No, sorry, you can't say "My mathematical model has absolutely no relation with reality, but that's probably because one of the factors that I excluded is probably important, and so now I'm going to go on pretending that the model works."
The model is wrong. Invalid models to not produce valid results. Stop. Do not pass go. Do not collect $200. Do not get your paper published in a decent journal. Do get laughed at by people who aren't clueless jackasses.
At this point, we can see just why this paper appeared in a journal like JPANDS. Because it's
crap that's just attempting to justify a political position using incredibly sloppy math; math so
bad that a college freshman should be able to see what's wrong with it. But for the "reviewers" at JPANDS, apparently a college freshman level of knowledge of statistics isn't necessary for reviewing
a paper on statistical epidemiology.
Like I told the high school physics class I taught: The difference between a scientist and a crank is that a scientist can admit a mistake and throw away a theory.
I guess this paper just gives another example of that.
MarkCC wrote: "So what's the source? Well, it's published in JPANDS. JPANDS in a terrible journal."
It's quite fascinating you mention this, because I just received last week a mass mailing spearheaded by Frederick Seitz, pushing a petition denying the existence of global warming. It included as proof a January, 2000 op-ed in the Wall Street Journal, and a 2007 paper in - you guessed it - JPANDS.
The JPANDS mission statement includes:
"...a commitment to publishing scholarly articles in defense of the practice of private medicine, the pursuit of integrity in medical research...Political correctness, dogmatism and orthodoxy will be challenged with logical reasoning, valid data and the scientific method."
They same link also says they have a new focus and name (from Medical Sentinal) and "We have eliminated the news capsules that tend to be old news by the time a quarterly journal is published, and reduced the political commentary in favor of more articles of a scientific nature, particularly if they are relevant to contemporary policy debates."
They do have double-blind peer review so just because you criticize them here doesn't rule out your chances of getting published. :)
They have published on cancer/abortion and vaccine/autism links in the past.
It is the journal of the AAPS (which is a very political organization):
I had to stop reading here:
When a woman is nulliparous, an induced abortion has a greater
carcinogenic effect because it leaves breast cells in a state of
interrupted hormonal development in which they are more
Seems to me one round of birth control pills after an abortion should, given the logic in the quote, "reset" the breast cells to their quiescent state.
"Seems to me" is bad science. There are plenty of things that seem to make sense, but which are wrong. So doing a study like this to examine the question makes sense.
The problem with the paper isn't that they're examining that hypothesis. The problem is *how* they examine it.
There've been other studies that examine that question, and their results have been, pretty uniformly, that there's no statistically significant difference.
What this author does is slap together a piss-poor analysis to create the result he wants, without regard for what the data actually says.
Don't forget Simpson's Paradox. This is a rather common problem when dealing with aggregated data in epidemiology and economics, and one of its fingerprints is a disconnect between known individual relationships and what you see in aggregate relationships.
The paper is rather vague on the data collection methods, but note that the plots have varying time axes (1926-1950, 1923-1968, 1926-1954) and the regression is only based on a further subset of 15 data points in table 1 out of 25 in Figure 3.
The response variable is roughly monotone in time (figure 3), so any predictor that is also monotone is going to have a kick-ass correlation coefficient, no causality needed.
It gets even worse. This kind of study violates the learn-confirm paradigm by model-building and conclusion-drawing with the same set of data, and no attempt at any sort of validation of the model.
Nor is there any sort of well-known model-building approach, such as stepwise regression, even attempted. (Ok, ok, so this is second year statistics, but still, easily accessible.)
"I just don't even know how to mock this"
I'll help you..."it's not even wrong".
Thank you for posting this information. Unfortunately, the paper will be quoted by those groups who want to show "proof" that you should "be fruitful and multiply" and "not have abortions". I can respect moral arguments for not having abortions, but the paper doesn't provide scientific support for it.
Given the proposed model (leaving the breast cells in a state of "interrupted hormonal development) I can't see what the difference would be between an abortion and a miscarriage, except that the latter would generally be far more common, at least during early pregnancy. That wouldn't necessarily make his argument false (though it's pretty obvious it is, for other reasons) but it would suggest that focusing on the politically hot abortion issue would be a bit suspect when the real effect would be considerably more general than that. Why isn't he out there terrorizing women who just had miscarriages instead, or as well?
Now *that* is a rhetorical question.
"It's very deliberately ignoring independence from other variables"
Did you mean a different word than "independence" there?
Analysis problems aside, when you get a result that flies in the face of accepted understanding, you have to say "well, either I'm completely wrong, or I've really found something worth investigating", and the next place to go is to try and figure out where you went wrong (since that's much more likely).
Only when you can't find any alternative explanation for what you find do you start looking at what has gone before and try to figure out why it might be wrong.
In either case (you results don't fit because you're wrong or they don't fit because you're right and everyone else is wrong), you have a lot of further work to do before you can publish.
The first response to a weird result should not be "publish!".
[Then he computes the average age of first birth for all women born that year, and tries to correlate it with the breast cancer incidence for women born in that year. ]
I'd say a high schooler shouldn't make this mistake. He wants to imply something about causality from an average? Averages come as mathematical information, not physical causes as basically everyone knows.
[But all that the author considers is linear correlation. He gives no reason for that, and makes no attempt to justify it. It's just presented as if it's beyond question.]
Of course, the math gives him a linear correlation. Math must speak the truth, so the linear correlation must indicate soemthing real. (please notice the sarcasm here.)
[In other words, according to their measure, fertility - the rate of live birth - is not a significant factor in breast cancer rates compared to abortion.]
Never mind that live birth causes massive chemical changes in women's bodies. Especially, in the few days/week after children leave the womb.
[The notion that abortion as a risk factor for breast cancer completely dwarfs the reduction in risk after childbirth - when we know that having children causes a dramatic decrease in the risk of breast cancer - is unquestionably wrong. ]
Strange, I pre-thought it would work the other way around. Nonetheless, I did at least think that having children would change the risk in breast cancer... as obviously having children yields massive chemical and biological and phyiscal changes. How can anyone think otherwise?
I just don't even know how to mock this, because it's so off-the-wall ridiculous.
The top three states in home ownership are West Virginia, Mississippi, and Alabama. Clearly, poverty is a risk factor for home ownership.
Not absurd enough?
Okay. To return to topic of the paper being discussed, there's also a negative correlation between UFO sightings and abortion. Clearly, the saucer people are pro-life.
This paper presents a cohort analysis, a rather basic tool for comparing populations. The unit of analysis is the cohort and not the elements comprising the cohort. The meaningfulness of a predictor (such as years of nullparity) depends on how common it is to the members of a cohort and the linearity of the effect on the outcome. It is difficult to project cohort results down to members of the cohort (which is what the author wishes to do) unless the members of the cohort have common exposure, which is not the case here. He's not aggregating a linear relationship.
Some passing technical notes: (statistical) independence is not required for regression analysis. That's part of what the inv(X`X)X` term adjusts for in a regression formula. The coefficients are partial coefficients, adjusted for the other regressors. It also means his marginal tests are irrelevant to the partial contribution of a regressor (the coefficient in the presence of all other regressors). What he should have looked at was the partial coefficients for his seven predictors. Of course with a sample size of 15, very little is going to be "significant". I'd guess that he originally tried that, went "oops" and started hunting for something publishable.
Linear correlation coefficients pick up on all sorts of monotone trends quite nicely. As Mark notes, they botch non-monotone ones.
There is nothing unusual or wrong about looking at cohort data in a scatter plot. That's how you pick up non-linear relationships. He's trying to cram three plots into one in the paper (X vs. Y, X vs Cohort Year, and Y vs Cohort Year). A scatterplot matrix would have been more informative here, as well as highlighting the shifting sample sizes.
As I stated in a comment on Dr. Oracs' blog, the notion that the anti-abortionists have any interest in scientific integrity is piffle. Their only interest is in making abortion illegal and lying to achieve that objective is considered perfectly acceptable. It's called lying for Joshua of Nazareth.
Before we talk about who's being biased and selective, I think it's worth actually looking into what breast cancer is at the molecular level.
That's an easy-to-understand, recent paper by one of the giants in breast cancer research - Dr. Russo et al., and a fact sheet by Breast Cancer Prevention Institute:
Thanks for publishing this. I don't have even second-year statistics so every little bit helps. Conclusion: the author and the journal have an axe to grind.
I think you mean "JPANDS is a terrible journal."
One of the big risk factors identified demographically by, of all people, Adele Davies (the author of popular books about nutrition and vitamins half a century ago), was the amount of fat in the diet before menarche. She correlated breast cancer rates with countries and noted that girls who moved to high-fat countries (e.g. the U.S.) from low-fat countries (e.g. Japan) tended to keep the good statistics. I did not go to the original papers to see if she was reporting them accurately. But I'd be very interested to see someone follow up on her hypothesis.