Gibberish

I don't read Greg Easterbrook, for roughly the same reason I don't read anything else in the sports pages. When I want to get the experience of bulky men straining themselves trying to exceed their innate abilities, I watch C-SPAN.

I was reminded of why I don't read Easterbrook by a comment that Brad Delong quotes from a post by Matt Ygliesias, in which Matt quotes Easterbrook saying that the Lancet's study of excess mortality in Iraq is "silly" and that it:

absurdly estimates that since March 2003 exactly 654,965 Iraqis have died as a consequence of American action.…It's gibberish.

Does the Lancet actually assert that "exactly 654,965 Iraqis have died as a consequence of American action"? Of course not. Is it gibberish? Judge for yourself. As SciBling Tim Lambert points out, the authors wrote:

We estimate that as of July, 2006, there have been 654,965 (392,979-942,636) excess Iraqi deaths as a consequence of the war, which corresponds to 2.5% of the population in the study area.

Understanding why that is different from what Easterbrook wrote turns out to be important in much of our lives. How we understand uncertainty in measured numbers matters, and matters more and more as time goes by. Most importantly, when you understand the basics of statistics, you can spot lies with statistics pretty easily.

I've TAed stats classes and worked with other students to try to help them understand how stats works. The key problem in understanding statistics is not the math, it's the mindset.

In statistics, you are always dealing with two numbers. The number people tend to focus on is what we call a measure of location. A batting average, the estimate of 654,965 excess deaths above, the 40% of poll respondents saying they'll vote for Jim Ryun – these are all measures of location, they place you somewhere in numerical space.

The example I'll work with here is the geographical center of the United States, a point a little west of where I am now. That is the point on which a map of the country could be balanced. In some sense, it describes where the United States is, and you could use its latitude and longitude to compare the US with other countries.

No one, of course, would claim that the United States only exists at its geographical centroid, the average position of the nation. That's where the second number comes from, it is a measure of variability (statisticians usually call it a measure of spread or of dispersion). In fact, it's hard to think of an example of a measure of variability that people bandy about the way they do with averages.

With our map example, we might choose to indicate the size of the nation by talking about the maximum (or minimum) distance from the edge of the country to the center described above. We might compare a baseball player's batting averages in several games to assess whether he is streaky or consistent in his batting. Margins of error indicate the error associated with a poll's estimate of public opinion.

What Easterbrook did wrong was look only at the first number, latch onto false precision in the estimate of excess mortality, and ignore the measure of dispersion.

The range the authors quoted is what's called a 95% confidence interval. What it means is that 95% of the time, if the excess mortality were truly within that range, you'd get data no more extreme than what the researchers actually got. More importantly than the measure of location (the 655,000 figure) is that the variability does not include 0. What that means is that if there were actually no additional mortality since the invasion and occupation, it would be practically impossible to get the data the researchers actually observed.

That's a conclusion that follows from statistical hypothesis testing. We can reject the hypothesis of zero mortality. We cannot reject a hypothesis that an excess 950,000 people were killed, nor that an excess 400,000 have died. We know, based on our understanding of the variability of the data, that the number of excess deaths since the invasion is somewhere in between, and that it is most likely that the number is right around 655,000. The variability matters here.

More like this

Actually, this does raise the question of why they reported so many more decimal places than they had significant figures. Shouldn't it be, like, 650,000 (400,000 - 950,000)? That's just as meaningful, and less likely to obscure the wide range of potential variation ...

By Scott Simmons (not verified) on 26 Oct 2006 #permalink

I agree with Scott here. I haven't read the paper, but did their data really allow them to express that many significant figures in the numbers?

Note: The U.S. has 3 geographic centers, depending on whether you count just the lower 48, the 49 continental states, or all 50. The reference point for determining lat and long was Meades Ranch, located about 12 miles north of Lucas. At least it was until 1983, when the reference changed from a point centered to an Earth centered one.

As an actuary, I deal with this issue a lot. There is nothing wrong, statistically, with expressing the data the way the Lancet people did. After all, these guys are well-trained in statistical methods. Trust me, there is no way any layman is going to spot a legitimate error they haven't already considered. This isn't a hard science measure, which is where all this talk of significant digits comes from IIRC. Statistics is a whole different ball game.

On the other hand, politically, I have suggested to my actuarial brethren than we ought not to express our findings in this manner for exactly this reason - it leads layman to draw false conclusions and levy inappropriate criticisms our way. Better to round the figures so that anyone can tell they are estimates. Had the Lancet reported the mean as 655,000, a lot of these criticisms would not have arisen.

But let's be honest too. Many criticisms of statistical findings, here and elsewhere, are born of an agenda for a certain result, not of an agenda for good statistics. Most of the critics of the Lancet study simply don't like the result, and are going to criticise it any way they can.

As an actuary, I deal with this issue a lot. There is nothing wrong, statistically, with expressing the data the way the Lancet people did. After all, these guys are well-trained in statistical methods. Trust me, there is no way any layman is going to spot a legitimate error they haven't already considered. This isn't a hard science measure, which is where all this talk of significant digits comes from IIRC. Statistics is a whole different ball game.

I am by no means qualified to render any kind of serious criticisms regarding their use of statistics, both because as I say, I haven't read the paper, and I'm not really familiar with the nuances of that field. The numbers themselves though are rather striking just because of their apparent exactitude.

You mention it's not a hard science measure, but it is a measure of some kind. Even the apparently simple act of counting bodies has an associated error, especially in an active war zone.

It's just that numbers like those tend to really take aback those of us that are used to using sig figs all the time.

I hate significant figures. They were multiplying 2.6 excess deaths per 1,000 by an estimate of population. That estimate was given to the units place, so it's certainly justifiable to give their estimate to the same precision.

The significant figures are a lot less relevant here given that there's a more precise measure of internal variability. Sig figs reflect computational accuracy, but the issue here is variability in the population being sampled.

Unfortunately the press release omitted those measures of variability. It began "As many as 654,965 more Iraqis may have died since hostilities began in Iraq in March 2003 than would have been expected under pre-war conditions...," when something like "Approximately 655,000 (between 400,000 and 950,000) more Iraqis..." would probably have better illustrated that this was an estimate.

For what it's worth, the researchers probably didn't write that press release.

Fair points Josh. I think culturally most people are programmed to think of estimates as having few digits and maybe some trailing 0's. When taken out of context (for instance when the variability is excised), the point that the number is an estimate can easily get lost.

Just like the other day when the time (to the nearest minute) was announced for America's population to hit 300,000,000. Some reporters said something about the number being an estimate, yet still emphasized the exact tick and tock for that magic number to come.

Anyone who gets their information from sportscasters really needs to change the channel. They have to be, on average, the most unskilled, unprepared professionals there are (with notable exceptions as with any group). Many of them can't even be bothered to know the rules of the games they announce, or the players' names and positions in them. Was I the only one that noticed many of them seem to not even know that the BCS computers don't use margin of victory. That's like a vet not knowing what species your dog is.

Sorry to rant somewhat off topic, but please. Sportscasters giving opinions on statistical analysis? Let's hire janitors to build our buildings while we're at it.

Well, it should be pointed out that Easterbrook isn't your typical sportswriter. His "day job" is a writer and editor for the New Republic but he writes a weekly NFL column for ESPN.com.