Seed Media Group

Search this blog

Search older postings


Profile

Tim Lambert Tim Lambert (deltoidblog AT gmail.com) is a computer scientist at the University of New South Wales.

Deltoid Facebook Group

Recent Posts

Recent Comments

Categories

Archives

Links

Blogroll

Archives of previous Deltoid

16th

Subscribe via Email

Stay abreast of your favorite bloggers' latest and greatest via e-mail, via a daily digest.

Sign me up!

« Delurk and win! | Main | Classifying abstracts on global climate change »

Robert Chung on David Kane

Category: LancetIraq
Posted on: August 30, 2007 2:47 PM, by Tim Lambert

Boosted from comments. Robert Chung writes:

David Kane wrote:

Anyway, it seems clear to me now that you are bluffing

Me, bluffing about knowing how calculate a CMR? Ouch, that hurts.

David, what a fascinating example of hubris. You do not know how to do something, so you conclude that no one else can either. However, that something "seems clear to you" has, once again, led you down the wrong path -- though for you this seems about par for the course.

As you ought to have known long ago, we are clearly not "in the same boat." The reason you ought to have known this long ago is that you have had in your possession the proof of what I have been saying -- but with the blinders you're wearing you couldn't see it. 20 months ago, I showed a graph with the cluster CMRs; more remarkably, 14 months ago and then again one month ago, both times in response to your own requests, I have pointed you to my code in which can be found the "magic formula" for calculating the pre- and post-invasion CMRs. Perhaps you missed it since the calculations were cyptically and misleadingly labeled "pre-invasion CMR" and "post-invasion CMR"? I leave getting the overall CMR from the cluster CMRs to you as an exercise.

I, and others, have warned you that you have been confounding the estimates of CMR and the estimates of the CIs around those estimates. You keep saying that my estimates of the CMRs and excess mortality depend on bootstrapping. They do not. The proof is in the code you ignore. You keep saying that Roberts' estimates of excess mortality depend on normality. They do not. Despite your exegesis of the rest of the article, the proof is at the bottom of the left hand column on page 3, where the CMR calculation is given. Look at it, and please (please!) recognize that it does not depend on normality.

So this is what it comes down to: the estimates of excess mortality don't depend on normality, but your argument does, and there is no evidence that Roberts and Garfield made that assumption. You have done this even though there is no evidence for it and, in fact, there is evidence against it. Your argument is a phantom argument. There is nothing there. This is what Tim Lambert meant when he said that all you've shown is that assuming normality for the CI including Falluja is wrong.

David, there are legitimate criticisms of the Roberts and Burnhams articles. Yours isn't one of them. Your paper is trash, and you're hurting yourself. Do the right thing. Write Malkin and Fumento and tell them you didn't know what you were talking about. Tell them you apologize for the exploded heads. You can even tell them you're working on yet another crazy argument. You don't have to tell them that you accused a demography professor of not knowing how to calculate a CMR.

Comments

#1

I've been curious about a common element that adds a little zest to these controversies even if the controversies are phony or ginned up. Can anyone explain to me why scientists are seemingly reluctant to provide the underlying code in their studies? Is it an research/IP thing? Doesn't it make peer review more difficult?

Most importantly, is this the 500,000th comment?

Posted by: slickdpdx | August 30, 2007 5:27 PM

#2

Tim,

Are you sure that commentator "Robert" is Robert Chung and that he wants his identity revealed here? I have no reason to doubt your claim, but, unless I have missed something, "Robert" has never revealed his last name before and has provided his code/graphics at anonymous sites. Clarification from you, "Robert" or Robert Chung is welcome.

"Robert"

By the way, my claim is not that you are bluffing about how to calculate a CMR in general. Anyone can look up the formula for that. My claim is that you can't show us the code which produces the answer that is reported in L1.

In fact, let me call you out again. You failed to quote my entire sentence -- very rude behavior in the Deltoid community. I wrote:

Anyway, it seems clear to me now that you are bluffing, that you can't demonstrate the steps that the L1 authors went through to provide, say, the pre-invasion CMR of 5.0 (95% CI 3.7 -- 6.3).

What line in your code produces these numbers using the normal approximation, as the L1 authors used? Nothing that I can see . . .

Posted by: David Kane | August 30, 2007 5:57 PM

#3
Are you sure that commentator "Robert" is Robert Chung and that he wants his identity revealed here? I have no reason to doubt your claim, but, unless I have missed something...

That's a bit like Custer asking his aide, "are you SURE those are Sioux warriors slaughtering my troops?"

Posted by: dhogaza | August 30, 2007 6:55 PM

#4

"Can anyone explain to me why scientists are seemingly reluctant to provide the underlying code in their studies? Is it an research/IP thing?"

Of course. That and trying to avoid the myriad questions on how to actually build the code which will inevitably follow unless you have taken the time to make your code distributable. And if you have taken the time to make your code distributable it will certainly already be publicly available.

"Doesn't it make peer review more difficult?"

Absolutely not. If you're at the point where you feel the need to check someone's work in detail you need to write your own implementation of the algorithms they claimed to use. Starting with their code is just a lazy, half-assed way to verify work you don't trust. If you can't write you're own implementation of the algorithms you aren't qualified to check the code either.

Posted by: Another PS | August 30, 2007 7:59 PM

#5

Can anyone explain to me why scientists are seemingly reluctant to provide the underlying code in their studies? Is it an research/IP thing?

Of course. That and trying to avoid the myriad questions on how to actually build the code which will inevitably follow unless you have taken the time to make your code distributable.

I disagree on both counts. Claiming that your software does something without providing the code is akin to claiming that you proved something without providing the proof, saying that you want to keep it secret or saying that it is not in a "distributable" state. It should be unacceptable, but for some reason it is.

Doesn't it make peer review more difficult?

Absolutely not. If you're at the point where you feel the need to check someone's work in detail you need to write your own implementation of the algorithms they claimed to use. Starting with their code is just a lazy, half-assed way to verify work you don't trust. If you can't write you're own implementation of the algorithms you aren't qualified to check the code either.

Again, I disagree. Without the code there are often many implementation details that are left ambiguous and could significantly impact the results.

Robert (Chung?:-) did provide code - the Iraq mortality study authors should have done the same, but they would be the exception if they did.

Posted by: Sortition | August 30, 2007 8:16 PM

#6

"Without the code there are often many implementation details that are left ambiguous and could significantly impact the results."

There are valid reasons for not giving out one's computer code (not least of all the fact that such code often takes considerable time and effort and competitive advantage in one's future research may depend on it) and the argument for doing so in all cases is simply not convincing -- at least not to me.

While that is often true that there can be ambiguities without the actual code, this depends a great deal on what the code does and how complex it is. Straightforward statistics can be -- and is -- done with a large variety of different computer codes with the very same results for all practical purposes.

If the documentation for the algorithm and its implementation are good enough, someone should be able to reproduce the results no matter how complex the program is. That's really not as hard to do as some people make it sound.

Also, if one is checking the implementation of the algorithm itself, it is always best NOT to use the same code. That way, one can increase the likelihood that one will catch computer coding errors.In other words, one can make sure that the algorithm was properly implemented.

Of course if you use exactly the same code you should get the same result! If you don't, there is something very seriously amiss.

So what if someone can reproduce the results with the same code? Big deal. Other than catch gross errors, what does that really accomplish? Not much, I'd have to say.

Posted by: JB | August 30, 2007 9:13 PM

#7

Floundering desperately, David Kane wrote:

Are you sure that commentator "Robert" is Robert Chung and that he wants his identity revealed here? I have no reason to doubt your claim, but, unless I have missed something

You've missed something. Quelle surprise, eh? I guess we can add those posts to the list of things you've missed like, for instance, how to calculate a CMR. I'm pretty comfortable with who I am and I don't think I hide it. Many Deltoid regulars have known it for a while. Besides, exactly who I am is pretty irrelevant--it's funny but irrelevant. What's funny and relevant is that you have had the actual evidence before you for so long, both from me and from the Roberts article. No matter who I am, the fact remains that despite all your blustering I, in fact, do know how to calculate a CMR and you, in fact, do not. I think most people would agree that actually understanding mortality rates well enough to calculate them is, um, you know, probably kinda important if you're thinking of critiquing a mortality study.

In fact, let me call you out again. You failed to quote my entire sentence -- very rude behavior in the Deltoid community.

What, you want to call me out again? Dude, you still got marks on you. Don't you want to let the swelling go down a little or dab on some Bactine or something? David, altering the claim after the fact and then acting as if it was there all the time is way past rude--it's deceptive. Worse, it's stupidly pointless deception: anyone can go back and see what you've done. The original claim was about the CMR. You added the rest afterward. Desperation breeds stupidly pointless behavior. Don't be desperate. It's unbecoming.

Posted by: Robert | August 30, 2007 11:39 PM

#8

This is so much fun that I want to invite others to play. As Tim has kindly noted before, I collected and cleaned up the (released) data from L1 in a handy R package which you can download from CRAN. Once you do, you can do stuff like this:

library("lancet.iraqmortality") data(lancet1) mean(lancet1$pre.mort.rate) [1] 5.3 mean(lancet1$post.mort.rate) [1] 14

In words, if you just take the simple cluster mean of pre and post CMR, you do not get the same estimate as reported in L1. Why not? Good question! I do not know how to replicate the results reported in L1 and do not think that Robert Chung, despite being a professor of demography, can do it either. That is our dispute. (The estimate reported in L1 is 5.0 for pre-war and 12.3 for post-war.)

If Robert can reproduce those numbers, he should prove it instead of just flaunting his credentials.

And, he might be able to! If he can, we would all learn something. Science progresses by such small steps.

But I bet he can't . . . .

Note that I am not implying bad faith on the part of the L1 authors on this point. The calculation they performed was (I believe) a reasonable one which made use of more data than they have actually released. Robert can't replicate it, not because he is stupid, but because no one can.

No one has replicated the results from L1 using the same methods as the authors use.

Posted by: David Kane | August 30, 2007 11:45 PM

#9

just a reminder to those of you working in non-epidemiology scientific fields that building epidemiological models requires an extensive process of model-building that is not automated. The code is not a "program" as such, but a series of instructions and judgements by the epidemiologist, often including digressions to produce graphs and the like. Disputes from the code will tend not to happen; but unresolvable disputes about the model-building decisions will almost always happen. Often the code may not be particularly accessible - in SPSS for example, residual graphs are almost always constructed from menu options, so it is impossible for the "code" to be sufficient for the model-building process.

This is in essence what David Kane is doing without seeing the code of L1 - he takes issue with the exclusion of Fallujah, which is a model-building decision taken after judging the outcome of the starting model (e.g. examining leverages, etc). There are no confounders in the model that I am aware of, so the rest of the model-building process is trivial and not open to dispute.

Epidemiologists pretty much have to assume that the code is irrelevant, and tackle these "operator" decisions (e.g. by emailing the author to ask "did you consider this variable and if so how").

In fact if an epidemiologist sent me the code for a model, and I could run that code and get the final model without any intervention or checking by me, I would consider the model to be dodgy straight away. That means they have used an automated model selection procedure, which is straightaway suspicious for anything but the most regular of data.

Not that it matters in this case, since David Kane hasn't given any evidence that he could understand any code he was sent by the L1 authors.

Posted by: SG | August 31, 2007 12:00 AM

#10

JB,

So you are presenting two cases:

  1. The code is simple. If the code is simple, why not just publish it and resolve any potential ambiguities? It seems that this is the case of the Lancet Iraq studies.

  2. The code is complex. If the code is complex and contributes significantly to the results in the paper then it should be considered an essential part of the paper, and must be published even if that would reduce the "competitive advantage" of the author. Science is supposed to be about sharing your information - you have to have a very good reason to withhold information. If you want to keep your competitive advantage, just don't publish.

Posted by: Sortition | August 31, 2007 12:18 AM

#11
And, he might be able to! If he can, we would all learn something. Science progresses by such small steps.

And yet, you wouldn't put any effort into reversing the damage done by your right-wing bloggy/media buddies by saying something as simple as "I was wrong! Please, tell all your readers and listeners that I was wrong!"

Damage done. That's the point. "Oh, I have a result that matches my political bias, I'm going to tout it as being the truth!" without regard as to whether or not you know what the hell you're talking about.

We know you don't care, David...

Posted by: dhogaza | August 31, 2007 12:24 AM

#12

David Kane sniveled:

I do not know how to replicate the results reported in L1 and do not think that Robert Chung, despite being a professor of demography, can do it either. That is our dispute. (The estimate reported in L1 is 5.0 for pre-war and 12.3 for post-war.)

If Robert can reproduce those numbers, he should prove it instead of just flaunting his credentials.

And, he might be able to! If he can, we would all learn something. Science progresses by such small steps.

But I bet he can't . . . .

Oho! A bet! Excellent! What will you bet? How about what I suggested earlier? You write to Malkin and Fumento and tell them you don't really know what you're talking about?

Posted by: Robert | August 31, 2007 12:55 AM

#13

Robert,

I am sorry but arguing with you is getting boring. The only claim on this topic that I have ever made is that no one, including you, has been able to replicate the CMR estimates published in L1. I'll make it again. L1 estimates 5.0 for pre-war CMR and 12.3 for post-war CMR. Use the data that Tim provides and show us the R code which produces those numbers. You can't do it. (Admittedly, the comment which seems to have upset you is unclear.)

I never claimed that you, or anyone else, can't calculate a CMR in general. You, or anyone else, can since the formula is trivial. I did question your bona fides to lecture me on the topic. Alas, despite being a professor, you have failed to act like one on this thread.

Posted by: David Kane | August 31, 2007 7:52 AM

#14

ffs,

David:

pre.cmr<-12000*sum(lancet1$pre.deaths)/sum(lancet1$pre.person.months) pre.cmr [1] 4.993758

satisfied?

furthermore

post.cmr<-12000*sum(lancet1$post.deaths)/sum(lancet1$post.person.months) post.cmr [1] 12.30867

satisfied?

I did it with your data frame (I renamed it lancet1 for some stupid reason). When you say you can't replicate the L1 CMR, what exactly do you mean?

Posted by: SG | August 31, 2007 8:47 AM

#15
How about what I suggested earlier? You write to Malkin and Fumento and tell them you don't really know what you're talking about?

Since neither of those people will care nor publish a correction, how about additionally requiring him to post that letter here?

Posted by: dhogaza | August 31, 2007 9:32 AM

#16

On behalf of the many people who don't use R, I am pleased to confirm that I have been able to replicate SG's figures using an Excel spreadsheet.

Will that do, David, or should I be demanding that Microsoft release their source code?

Posted by: Kevin Donoghue | August 31, 2007 10:12 AM

#17

Sortition:

perhaps you neglected to read my very next line (after the simple/complex part)

"If the documentation for the algorithm and its implementation are good enough, someone should be able to reproduce the results no matter how complex the program is. That's really not as hard to do as some people make it sound.'

I've done software engineering (over a decade) and programming (going on 30 years) long enough to understand that with proper documentation, it is quite possible to reproduce the same output (though the actual code may be quite different).

If that were not the case, most statistics packages would give different answers for the very same input.

The key element with science is that you provide enough information that someone who is "skilled in the art" can repeat the results. That does not mean you have to give them every last detail. In fact, in most cases, the assumption is made that the person reading your paper is going to have enough background in the area to understand the basic ideas and steps without listing every one of them like you would have to do with a complete novice.

Posted by: JB | August 31, 2007 11:27 AM

#18

SG wrote:

[snip]

Damn you, SG. I was hoping to hustle David into a bet.

For everyone else, SG's calculation is exactly what any epidemiologist, biostatistician, or demographer would have done; it's what Roberts and Garfield must have done. David Kane was calculating an unweighted mean of the cluster CMRs thinking that would get him an overall mean. That only works when the cluster sizes are all the same. In this case, the cluster sizes aren't very different -- but they're just different enough that anyone doing careful analysis needs to take it into account. David doesn't do careful analysis.

Kevin has checked this with Excel so others can, too. You may want to use this file, which is already in .csv format and can be read directly into Excel.

  1. Is the properly calculated pre-invasion CMR = 5? Yup.
  2. Is the post-invasion CMR = 12.3? Yup.
  3. Is the pre-invasion CMR excluding Falluja = 5.1? Yup.
  4. Is the post-invasion CMR excluding Falluja = 7.9? Yup.

You can calculate the excess mortality including Falluja: 17.8 months, 24.4 million people:

(12.3 - 5)/1000 * (17.8/12) * 24400000 = 264000

To calculate the excess mortality excluding Falluja, do the same thing as above but remember that in excluding Falluja, you're only estimating for 32/33rds of the country:

(7.9 - 5.1)/1000 * (17.8/12) * 24400000 * (32/33) = 98000

The relative risk including Falluja: (12.3/5) = 2.5

The relative risk exluding Falluja: (7.9/5.1) = 1.5

No assumptions about bootstrapping. No assumptions about normality, or any other sampling distribution. All of the estimates reported in the Roberts article, replicated.

David, once again, you have shown that you are eager, determined, self-confident, clueless, misguided, and incompetent. Your entire argument is built on: "I can't figure it out, so no one can; since no one can figure it out, why bother asking anyone else?" David, you're spanked. You're drubbed, whupped, and schooled. You deserve all of it. You need to read this.

One more thing: "Michael Fumento! Michelle Malkin! Tim Curtin! Shannon Love! Can you hear me? Your boy took a hell of a beating! Your boy just took one hell of a beating!"

Posted by: Robert | August 31, 2007 11:27 AM

#19

"Alas, despite being a professor, you have failed to act like one on this thread."

What does it mean to "act like a professor"?

Do all professors act (behave) the same when subjected to the same external forces?

or, put another way, are professors more like apples? Or like electrons?

Is there an "Uncertainty Principle" for professors?

These are very important questions.

Posted by: JB | August 31, 2007 11:49 AM

#20

So, its that simple and absurd? Kane simply didn't know he had to adjust for cluster size?

I've been loosely following this argument, without bothering to dive ina nd look at the data myself. In this thread, it's been clear that Robert Chung already knew what the issue was, and had the same results as Lancet, or he would not have been baiting Kane so strongly. I was looking forward to seeing what the issue was, and expecting something interesting and perhaps even a bit subtle, something from which I might learn a bit about demography.

But - a failure to consider weighting, with different cluster sizes?

I'm just a poor biologist, mathematically acceptable but no more, trained through linear algebra, fought my way successfully through p-chem and stat methods, spent my time in SAS on a Vax-VMS, just enough training to know that I ALWAYS want to confirm any complex analysis with a competent statistician - but even I am startled, befuddled, bemused - astounded, actually - that anyone who considers himself competent to make this kind of attempted critique could make, and defend without thought that he might be wrong, that kind of basic error.

Posted by: Lee | August 31, 2007 12:12 PM

#21

David -- just stay down!

Posted by: jre | August 31, 2007 12:16 PM

#22

I am impressed someone can do the calculations easily in Excel, but also, what's wrong with R? I got R to help a scientist friend overseas (I couldn't help debug stuff with it till I knew how to use it). It's a great system, geared excellently to "checking out" and "checking in" large data sets. And it's free!

Posted by: Marion Delgado | August 31, 2007 12:17 PM

#23

Sortition wrote:

"I disagree on both counts. Claiming that your software does something without providing the code is akin to claiming that you proved something without providing the proof, saying that you want to keep it secret or saying that it is not in a "distributable" state. It should be unacceptable, but for some reason it is."

and

"Again, I disagree. Without the code there are often many implementation details that are left ambiguous and could significantly impact the results.

Robert (Chung?:-) did provide code - the Iraq mortality study authors should have done the same, but they would be the exception if they did."

I guess we have a fundamentally different understanding of what a journal article should be. The point of an article ought to be to report particular results, not to claim that some code does 'X'. That's what software companies do, not scientists.

In the course of reporting your results you need to give enough information so someone could reproduce your results. If there are ambiguities that could significantly effect the results you haven't really given enough information to reproduce the results, have you? I would further claim that in general you should give the minimum amount of information needed to reproduce your results or your paper turns into a description of your coding practices and other methodology rather than a discussion of your results.

Posted by: Another PS | August 31, 2007 12:28 PM

#24

Lee said: "So, its that simple and absurd?" Kane simply didn't know he had to adjust for cluster size?"

I think this is precisely what the person who invented the term "cluster-fuck" had in mind.

Posted by: JB | August 31, 2007 12:35 PM

#25

JB,

I've done software engineering (over a decade) and programming (going on 30 years) long enough to understand that with proper documentation, it is quite possible to reproduce the same output (though the actual code may be quite different).

This is certainly true since it is a tautology. The problem is that we may have a hard time agreeing what constitutes "proper documentation." If the code is provided, then there is no room (or at least much less room) for disagreement. I have seen many published papers which left enough details undocumented to allow significant manipulation of the results.

But even if we accept your claim that publishing the code is not always necessary, I cannot see what damage would be done by always doing so. Without a good reason not to publish the code, it seems best to always publish the code, even if sometimes it may not be necessary.

Posted by: Sortition | August 31, 2007 12:43 PM

#26

The point of an article ought to be to report particular results, not to claim that some code does 'X'. That's what software companies do, not scientists."

never claimed that a scientific paper should just "claim that some code does X"

Perhaps you are not familiar with the term "Documentation" as it applies to computer software, but if it is done such documentation is done properly, it tells you all you need to know to reproduce the same output for a given set of inputs.

Say I write a paper that claims i have a method for finding the hypotenuse of a right triangle given its two legs (contrived, I'll admit, but it serves to illustrate my point)

To demonstrate the result, I can either

1) provide the computer source code that does it 2) give the result for one right triangle and tell how one can reproduce the same result for that triangle (and other right triangles) -- ie, by "documenting" the algorithm (ie, Pythagorean theorem) and its implementation.

Actually, I don't even need to document the implementation in the above example. Anyone who knows anything about computer programming at all should be able to reproduce the result from the Pythagorean theorem alone.

For scientific purposes, 1 and 2 are equivalent, (though #2 admittedly takes more work on the part of the person trying to repeat the experiment)

The only difference between that simple example and more complex problems is the detail that is required in the documentation. but that does not mean it is not possible. In fact, it is done at software houses every day throughout the world. If documentation is good (complete, accurate), it is all that is needed.

Posted by: JB | August 31, 2007 12:57 PM

#27

To demonstrate the result, I can either

1) provide the computer source code that does it 2) give the result for one right triangle and tell how one can reproduce the same result for that triangle (and other right triangles) -- ie, by "documenting" the algorithm (ie, Pythagorean theorem) and its implementation.

Again: I have seen many papers that claim to do 2), but leave enough details out so that what they actually do becomes significantly ambiguous. You will probably claim that those papers were not well written, which may be true, but they were published nonetheless.

Again: I do not see any reason not to require both 1) and 2) - do you?

Posted by: Sortition | August 31, 2007 1:12 PM

#28

I will certainly agree that saying something can be done in practice does not mean it will be done. in fact, that is a major problem with far too many software projects -- that the documentation does not adequately describe the program.

But there are certainly no guarantees of anything in life. :)

I already provided the reason why i think people should be able to keep their source code private. It really is a matter of competitive advantage.

If a scientist -- Stephen Wolfram, for example -- puts years into a software project like mathematica and then uses it to calculate results for a scientific paper, does that mean he has to provide the source code for his Mathematica program? (note I am talking about the underlying source code for mathematica)

I think not, but i think we probably have an unresolvable disagreement on this -- and it all boils down to a matter of opinion anyway.

Posted by: JB | August 31, 2007 1:52 PM

#29

It really is a matter of competitive advantage.

So the idea is that you give proper documentation so that your competitors could reproduce your work, but you don't give the code so that it is not too easy for them to do it?

To me, this seems like a nasty hybrid between science and business. It also appears to encourage writing deliberately vague documentation to make the lives of the competitors even harder. If to you this approach makes sense, then I guess we will indeed have to agree to disagree.

Posted by: Sortition | August 31, 2007 2:29 PM

#30

Marion Delgado,

One attraction of using Excel rather than R is that the data is already available as a spreadsheet here. Les Roberts made it available to David Kane - to whom all credit for taking the trouble of chasing after it and passing it on to Tim Lambert. It's a pity he didn't have Robert Chung by his side to show him how to use it! Incidentally, for those who don't like paying Microsoft for spreadsheet software, there is always Open Office .

I'm sure R is well worth learning. But the spreadsheet is surely the simplest tool for showing that David Kane's critique really doesn't amount to much. There is no point demanding to see the code used by researchers to obtain estimates if all the code has to do is simple arithmetic. Obviously the calculation of the bootstrapped CIs around the estimates is another matter; it would be interesting to know exactly how that was done. But it's hard to see that anything important turns on it, except for the people dsquared aptly calls percentile fetishists, who will no doubt feel the earth move if they can find some semi-plausible algorithm which squeezes 2.5 percent of the excess-death CI below zero, even if the upper 2.5 percent limit goes to the stratosphere.

Posted by: Kevin Donoghue | August 31, 2007 2:44 PM

#31

So the idea is that you give proper documentation so that your competitors could reproduce your work, but you don't give the code so that it is not too easy for them to do it?'

That is precisely it.

Telling someone (even in detail) how to do something is a far cry from handing them the source code that allows them to immediately start where you are and improve on it.

Like business, science is competitive, in case you had not noticed.

Actually, there is another major advantage (to the actual science) that I alluded to above: independent calculations (coding) are better than dependent ones.

If someone writes their own code to verify your results, it is much more likely that coding errors will be caught. There is a very famous example of a very involved computer calculation in physics that got an answer that was NOT consistent with QED Theory. A lot of people spent a lot of time scratching their heads (years) wondering why the theory di not agree with the experiments, until it was realized that the groups that had done the calculation and come up with the same answer (supposedly independently) had actually shared their work at a critical point.

They all made the same error, which would not have been the case had they each done the clauclation from scratch.

So yes, there is a major advantage to be had from doing that.

The other advantage -- and this is actually an advantage to the person trying to reproduce the results is that there is a chance they will notice something that the first experimenter did not, or at least come to a fuller understanding of the problem.

I must say that I really do think it is a matter of laziness more than anything else when it comes to demands for providing all source code.

Posted by: JB | August 31, 2007 2:52 PM

#32

Like business, science is competitive, in case you had not noticed.

Only too well (if by science you mean academic activity as it is happening in reality). But it shouldn't be that way and doesn't have to be that way, at least not to the extent it is.

Somehow, when teaching science we always emphasize the collaborative and open nature of the activity. The ideal, it seems, is very different from reality.

Posted by: Sortition | August 31, 2007 3:11 PM

#33

Sortition -

Since your questions about releasing code were interesting and valid, please read JB's replies, because I think he answered them completely. Releasing code could make it less likely to catch the inevitable errors.

And, (at the risk of piling on), thanks to all for making your excellent education of Mr. Kane so very clear.

Posted by: Mark Shapiro | August 31, 2007 3:15 PM

#34

[David Kane was calculating an unweighted mean of the cluster CMRs thinking that would get him an overall mean]

I just threw up a little bit in my mouth.

On the more interesting subject, I am on Team Sortition. In general, more code ought to be made available. On the other hand I do agree that it shouldn't be part of the peer review process for the reasons JB mentions - if you're checking someone else's work you shouldn't be using their code, for by and large the same reason that you will never learn anything from a textbook that has all the answers to the problem sets in the back.

Posted by: dsquared | August 31, 2007 5:19 PM

#35

Since your questions about releasing code were interesting and valid, please read JB's replies, because I think he answered them completely.

I beg to differ. I feel that my points regarding papers which produce results that are ambiguous due to missing details in the specifications of algorithms have not been properly addressed. Saying "then the authors should have put more details in" is just wishful thinking, not a solution.

BTW, I always make a point of reading closely what the people who respond to me write, even though sometimes I get the feeling that my efforts are not reciprocated.

Releasing code could make it less likely to catch the inevitable errors.

On the contrary - it is usually claimed that one of the advantages of open source software is that having many people view the code makes it more likely that bugs would be caught.

If I develop my code according to your documentation and I discover that my results differ from yours, it would be very difficult to discover if I have a bug, you have a bug, both of us have bugs, or (the most likely situation) I simply made a few design decisions regarding certain details that are different from your decisions.

Posted by: Sortition | August 31, 2007 5:32 PM

#36
If I develop my code according to your documentation and I discover that my results differ from yours, it would be very difficult to discover if I have a bug, you have a bug, both of us have bugs, or (the most likely situation) I simply made a few design decisions regarding certain details that are different from your decisions.

Whereas if you run the same code, you'll get the exact same numbers and never know there was a problem.

Posted by: pough | August 31, 2007 6:28 PM

#37

"On the contrary - it is usually claimed that one of the advantages of open source software is that having many people view the code makes it more likely that bugs would be caught.'

My experience is that visual inspection of source code is actually not a very good way to find bugs. This is because source code is usually not very well documented and scientists in particular (not computer scientists but other ones) are notorious for writing spaghetti code that uses gotos and other such niceties that make it virtually impossible to follow.

Much easier to follow higher level documentation of what the code does (or at lest of what it is supposed to do). BTW, there is also an advantage to forcing scientists to provide documentation for their code in that it may increase the chances that they find errors in their own implementation.

What it comes down to is this:

If two groups do a calculation independently (using the same methodology, algorithms, etc, but different coding) and get the same answer, the likelihood increases that they have at least done the coding right. The algorithm could still be faulty of course, but presumably if they have provided that in the documentation, someone can also check that.

On the other hand, if they get different answers, then that is a flag that there is a problem, of course. More investigation is then required to determine what the problem is. Clarification on the part of the original investigator may be required at that point. This is really not any different from the way science has always worked (ie, before computers came onto the scene).

i have seen this argument about showing source code many times before and one thing has always puzzled me. Perhaps it is because I was trained in science some time ago, but when i was at university learning to write scientific papers, i learned to describe my methods and materials so that someone with a reasonable understanding of the subject might repeat my experiment.

For some reason, that standard seems to have changed. Now it seems to have become "Do everything for the next guy -- so he/she does not have to do anything except start the program and write down the numbers that come out.".

Robert Chung seems to have done that for David Kane above. I find it absurd that someone would have to do that for a researcher at a University like Harvard.

Posted by: JB | August 31, 2007 6:40 PM

#38

pough:

Whereas if you run the same code, you'll get the exact same numbers and never know there was a problem.

Why would you just run the same code on the same dataset - that has been already done and reported on. The idea in providing code is to enable other people to examine the working of the algorithm and to apply it to other datasets.

JB:

i learned to describe my methods and materials so that someone with a reasonable understanding of the subject might repeat my experiment.

I don't know your work - it may be up to the standards of excellence you put up (although, apriori, you do seem overly self-confident here). There are, however, many papers which are not up to those standards. In those cases I need to see the code to understand exactly what was done.

You have given two reasons for not releasing code:

  1. To maintain competitive advantage.

  2. To force others to repeat the coding work as a way to verify correctness of the results.

I find both of these arguments to be anti-scientific. The first is a way to handle adversaries or do business, not science. If we accept the reasoning in the second argument, we might as well never publish any results at all since some people may accept those results at face value rather than examine them for errors. We can similarly argue that if we don't publish results, we force others to duplicate it and in that way verify it.

Posted by: Sortition | August 31, 2007 8:11 PM

#39

most people don t release their code, because they want to clean it up, before somebody sees it. most short code, written by an individual , will contain ZERO documentation and several unelegant constructs, that need real work to be replaced.

most people are busy these days, so they simply don t find time to invest into working code.

most people will provide the code, if personally asked by a person with reasonable interest.

it would be nice, if more code was awailable, but it is not realistic to hope for it.

Posted by: sod | August 31, 2007 9:48 PM

#40

most people don t release their code, because they want to clean it up, before somebody sees it.

most people are busy these days, so they simply don t find time to invest into working code.

it would be nice, if more code was awailable, but it is not realistic to hope for it.

It is simply a matter of making it a requirement for publication. People find the time to handle all the other requirements of publication - I see no reason why this would be any different.

Posted by: Sortition | August 31, 2007 10:40 PM

#41

Sortition and dsquared make the case for openness, but I think it is trumped by the need for independent verification. Don't we need experiments to be run by different people, in different times and places, to be confident that the conclusions are robust? JB's example of the error being propagated in QED is cautionary.

Also, what happens as code is modified by others? Whose is it? Who is responsible for errors and updates? It could easily become distracting.

Posted by: Mark Shapiro | August 31, 2007 10:45 PM

#42
It is simply a matter of making it a requirement for publication. People find the time to handle all the other requirements of publication - I see no reason why this would be any different.

Well, first you'd have to show ... 1. Utility. Hand waving, so far (and I manage an open source project) 2. Career protection, and yes, this is very important in a world where tenure, or pre-tenure hiring at top univiversities, is competitive-based. You can say "science shouldn't be like this" or - as is hinted above "scientists shouldn't care (i.e. scientists shouldn't try to get the best job at the best $$$ they can)". Change the structure of science hire/fire tenure/non-tenure policies, then maybe individual scientists will work as you think they should work.

Posted by: dhogaza | September 1, 2007 12:11 AM

#43

Well, first you'd have to show ... 1. Utility.

The prime utility, as I have stated several times, is removing ambiguity regarding what exactly is going on. As I have stated several times, I have seen many papers where the description of the procedure is far too short on details to remove ambiguities on several significant issues. Additional scrutiny of the code for bugs is a secondary benefit.

Hand waving, so far

Is this a way to have a discussion?

#2. Career protection

I don't really see what is the problem here. Requiring to publish code is not qualitatively different than requiring disclosure of many other details of the work being published - requirements which are standard practice.

As I have stated several times, you might as well suggest that divulging proofs of theorems risks your career because it lets the competition know too much - if they want to know the proofs, they should get off their lazy behinds and figure out the proofs by themselves.

Posted by: Sortition | September 1, 2007 3:09 AM

#44

1) I thank SG for replicating the CMR estimates for L1 and showing us all how he did it. This is how science is supposed to work! Someone (like SG) who knows something explains it to someone (like me) who doesn't.

2) I thank Robert Chung for replicating the excess death estimates for L1. I think that Robert's attempts to bait me into a bet were not how a professor ought to act, but opinions may differ on that score. It was because I thought that these estimates could be replicated that I declined to be trapped. But, to learn something new, I am always ready to be ridiculed, so ridicule away.

3) But we still have a problem! No one has replicated the confidence intervals for these estimates. Can anyone do so? I do not think that it is possible with the data that the L1 authors have released, but I have been wrong before.

Posted by: David Kane | September 1, 2007 7:03 AM

#45

And just to be clear that I am not the only puzzled member reader, I'll note that sensible Kevin Donoghue wrote:

I'm sure R is well worth learning. But the spreadsheet is surely the simplest tool for showing that David Kane's critique really doesn't amount to much. There is no point demanding to see the code used by researchers to obtain estimates if all the code has to do is simple arithmetic. Obviously the calculation of the bootstrapped CIs around the estimates is another matter; it would be interesting to know exactly how that was done.

Now, it is my understanding that the confidence intervals for the CMRs were not done with a bootstrap but with a normal approximation, in essence, whatver the standard STATA command spits out. I do not know if the excess death estimates involved the bootstrap. I think that they did not, that the bootstrap was only used for the relative risk confidence intervals.

Is there someone in the Deltoid community who can answer Kevin's question? He (and I!) would appreciate it.

Posted by: David Kane | September 1, 2007 7:19 AM

#46

David, it's time for you to do what you love demanding of others, and show us your code. Exactly what code did you use to calculate CMR's, which failed to replicate the results from the paper. Did you, perchance, get values of 5.3 and 13.7? If not, what did you get? You claimed above to know how to calculate the CMR (it's "trivial") so why couldn't you replicate it?

Show us the code.

Posted by: SG | September 1, 2007 7:19 AM

#47

And btw, David, your point 1) has not got anything to do with how "science is supposed to work". What has gone on here is how first year students are supposed to learn. A simple formula in a textbook, applied in a simple calculation package (or in this case, on a piece of paper), and the correct answer obtained.

You didn't even look at a textbook and now you claim that we are "all" learning something? And having been shown that everything you claim can't be replicated can be, you still insist on us proving to you that the CIs are accurate?

Posted by: SG | September 1, 2007 7:30 AM

#48

Kevin Donoghue wrote:

One attraction of using Excel rather than R is that the data is already available as a spreadsheet here.

Thanks for that reminder, Kevin. Here's something cool: cells T37 and U37 on the 'All data' sheet contain the pre- and post-invasion CMRs, and cells T36 and U36 on the 'without Falluja' sheet contain the CMRs without Falluja. As formulas, not as values, so not only can you see the values but you can also see how those values were calculated.

Hmmm. David has charged that Roberts et al. knew including Falluja would expand the CI to include zero, so they suppressed that information in their article and just focused on the "without Falluja" results.

However, in this case, David converted that spreadsheet into an R package, recommended that everyone use his package, but suppressed the one line that showed the overall CMRs. Then he insisted that "no one knows how they did it, and no one has ever been able to replicate it."

So, which is it? Is David really a dishonorable fraud who willfully made knowingly deceptive statements after manipulating the data, or is he just an incompetent braggart with Dunning-Kruger syndrome? I vote for the latter, but then I'm a generous guy.

Posted by: Robert | September 1, 2007 7:36 AM

#49

Excellent comments! I do, indeed, try to practice what I preach. Unfortunately, I am travelling right now, so the full answer will need to wait till Tuesday, but in the meantime I can offer the following.

1) You can download the latest version of my R package from here. I believe that this package includes the spreadsheet exactly as I downloaded it from Deltoid. Shame on me for not looking closely for formulas in the cells as Robert points out. The package includes both the pdf of (that version of) my paper along with the .Rnw document which produced it. This document (in Sweave format) lists every formula used. The package itself includes every function. You can replicate every detail to your hearts content.

2) But that version of the package is not the same as the one which supports either the paper as Tim so kindly posted it or the paper as I presented it at ASA or the current version of the paper. Once I get back to the office, I will immediately post a version of the .Rnw for the paper as Tim published. (I think that will be easy to do; I just hope that I have an appropriate notation in Subversion, my source control system.) If I don't have that version easily accessible, I definately have the ASA version, which is almost the same in all respects.

3) Given a couple of days, I will get the paper and the package into a format that can be updated on CRAN. (Without the latest version of the package, it may not be easy to replicate what is going on in the .Rnw file.) This is not as easy as it sounds since the newest version of the package includes all sorts of data from Jon Pedersen, some of which I can distribute and some of which I can't. So I need to be careful about that. Perhaps Tim will even be kind enough to host a new version of the pdf (which is much cleaned up and improved after our previous endless thread on the topic).

Posted by: David Kane | September 1, 2007 8:03 AM

#50

Robert,

You wrote:

Hmmm. David has charged that Roberts et al. knew including Falluja would expand the CI to include zero, so they suppressed that information in their article and just focused on the "without Falluja" results.

However, in this case, David converted that spreadsheet into an R package, recommended that everyone use his package, but suppressed the one line that showed the overall CMRs. Then he insisted that "no one knows how they did it, and no one has ever been able to replicate it."

I have just checked the R package linked above and, indeed, I did distribute the entire Excel spreadsheet including the cells you reference. I "suppressed" nothing.

And, for the record, I believe that no one has replicated the confidence intervals for L1 and it is, obviously, those confidence intervals that are the focus on my paper.

Can you replicate those confidence intervals?

Posted by: David Kane | September 1, 2007 8:12 AM

#51

Shame on me for not looking closely for formulas in the cells as Robert points out.

hm. a pretty weak excuse. your paper uses the term CMR exactly 118 times. but you didn t look at how its calculated in the data you examine? (again, it was LABELED CMR!)

Given a couple of days, I will get the paper and the package into a format that can be updated on CRAN.

i have some doubts that people are interested in more of your stuff. are you spreading out the news among right wing bloggers, that your results should be taken with a grain of salt? after it turned out you had some deficiencies in knowledge on the subject?

doesn t this event slightly change the approach you should take to the critisism your paper received here? by the same people who educated you on this subject now?

Posted by: sod | September 1, 2007 8:16 AM

#52

sod asks:

doesn t this event slightly change the approach you should take to the critisism your paper received here? by the same people who educated you on this subject now?

No. Interestingly enough, I think