Because Abi asked me to, I’m going to discuss the fascinating case of the Hellinga retractions. Since this is another case where there is a lot to talk about, I’m going to take it in two parts. In the first part, working from the Chemical & Engineering News article by Celia Henry Arnaud (May 5, 2008) , I’ll focus on the common scientific activity of trying to build on a piece of published work. What happens when the real results seem not to fit with the published results? What should happen?
In part 2, drawing from the Nature news feature by Erika Check Hayden (May 15, 2008) , I’ll consider this case in the context of scientific collaborations — both within research groups and between research groups. In light of the differentials in experience and power (especially between graduate students and principal investigators), who is making the crucial scientific decisions, and on the basis of what information?
But let’s start with the papers [3,4] that came out of the research group of Homme W. Hellinga, professor of biochemistry at Duke University.
In the original papers, Hellinga and coworkers claimed to use computational methods to design proteins called NovoTIMs that catalyze the same reaction catalyzed by the enzyme triosephosphate isomerase (TIM). Although the reported kinetic values for the best NovoTIM weren’t as efficient as the natural TIM enzyme, the work offered hope that scientists would someday be able to design proteins capable of catalyzing any reaction. (40)
The papers were published. And, as one would hope, the papers were read by other biochemists. (Ideally, the successful communication of results includes someone on the receiving end.)
One of the readers was biochemist John P. Richards at SUNY-Buffalo. Richards studies the natural TIM enzyme and was interested in nailing down why the designed ones displayed lower activity. Figuring this out might shed light n the sorts of relations between structure and function that keep biochemists up at night.
Richard and his coworkers used the method published by Hellinga’s group to make NovoTIM in bacteria and then purify it. But the protein they isolated had much higher kinetic values than the Hellinga team had reported for the best of its designed enzymes, Richard says. He and his coworkers suspected that this activity resulted from contamination with wild-type TIM from the bacteria used to produce NovoTIM.
When they used a different purification strategy to isolate the expressed protein, the protein they harvested showed no TIM activity. (40)
Richard’s findings suggested that Hellinga’s published findings were in error. Richard passed this information on to Hellinga, as well as to the journals that published the papers. Alerted of Richard’s results, Hellinga’s lab did the same experiments again, this time using Richard’s purification strategy, and they got the same results Richard did. Hellinga then retracted the published papers (the Science paper on Feb. 1, 2008, the Journal of Molecular Biology paper on Feb. 23, 2008).
At this point in the story, you may be thinking to yourself that this is a perfect example of science working as it should. We have scientists communicating their results, scientists taking account of the results of others, scientists trying to build further knowledge from reported results, and scientists communicating with each other when reported results don’t hold up. Published work is not standing encased in Lucite, but instead is checked, rechecked, and corrected.
As Merton might say, organized skepticism, baby!
Now, no scientists wants to publish results that, on closer examination, don’t hold up, but the default assumption is that this sort of thing is probably an honest mistake. Building new knowledge on the frontiers of what we already know, you may not know enough about the system you’re studying to have your techniques perfectly refined or your controls perfectly designed. As well, there is the eternal question of how certain you must be to publish what you’ve found. If you wait for absolute certainty, you’ll never publish, so you have to make a judgment call about the line that defines “certain enough“. Indeed, getting results out sooner rather than later means that others can use them sooner — and having other scientists in your community engaging in the same systems and problems might well speed up the process of finding problems with your initial results.
Of course, as biochemists like Richard know all too well, it costs those others (like him) significant time and resources to try to repeat published findings and discover that they are not reproducible. If you started with the published finding in order to pursue some other scientific question that assumed the reliability of the published finding, you’ve also discovered that you can’t do the project you set out to do. And, there is no great career payoff for demonstrating another scientist’s published result is wrong.
Still, all research is a gamble, and you might just chalk this up to the risks inherent in participating in a self-correcting community-wide project of knowledge-building.
However, Richard and others were concerned that this expenditure of time and money had not uncovered an “honest mistake”. There were aspects of the published results that didn’t make sense in the context of the type of proteins presumed to be in the experimental system. The puzzling details were described by UC-Berkeley biochemist Jack F. Kirsch:
“The retraction only admitted to contamination with wild-type enzyme,” Kirsch tells C&EN. “That doesn’t explain the very low KM values that they reported for NovoTIM.” KM, also known as the Michaelis constant, is a reaction parameter that defines the substrate concentration at which the reaction reaches half its maximal velocity. If the wild-type enzyme is the contaminant, Kirsch points out, “it’s very hard to think of a way you could get a KM value that is much lower than that of the wild-type enzyme.”
In his letter [to Science, published online March 10, 2008], Kirsch further pointed out that some of the reported results would only make sense if the design had actually succeeded. For example, Hellinga’s team reported that as each of the three critical active-site residues in the designed protein was replaced with alanine, the mutants became less active. Double mutants are less active than single mutants, and triple mutants are the least active of all. “That’s not what you would expect if there was random wild-type contamination,” Kirsch says. (40-41)
How to make sense of the incorrect data reported in the Hellinga papers? Could this data possibly come from NovoTIM contaminated with the wild-type enzyme? That was hard to reconcile with the reported KM values. Were the experimental measurements badly controlled, or carelessly taken? Had researchers in the Hellinga group thrown out good data because they didn’t support the result they were expecting (and keep bad data because they did)?
Could the data depart from the experimental reality (as Richard and coworkers found it) because they were made up?
Compounding the mystery of what could have been going on in the Hellinga group’s experiments to produce the data reported in the two papers, University of Illinois, Urbana-Chapaign biochemist John A. Gerlt, along with Richard, argued that even if the NovoTIM hadn’t been contaminated, the results the Hellinga group seems to have expected weren’t to be expected at all — at least, not on the basis of the structural features of the protein Hellinga’s group had designed.
TIM catalyzes the interconversion of dihydroxyacetone phosphate (DHAP) and D-glyceraldehyde 3-phosphate (GAP). However, the NovoTIM protein, as Hellinga’s group described it, abstracts a proton that results in the formation of L-GAP. In a letter to Hellinga, Richard explained the situation in more detail: “Any protein designed to catalyze suprafacial transfer of the pro-S hydrogen of DHAP via a cis-enediol[ate] intermediate would form L-GAP and would, therefore, not show activity using the standard enzymatic assay for isomerization of DHAP. This is because L-GAP is not a substrate for the coupling enzyme [glyceraldehyde 3-phosphate dehydrogenase] used in the assay for isomerization of DHAP.” Therefore, even if the designed protein had worked, the assays shouldn’t have given a positive response. (41)
The fit between enzyme and substrate is crucial for enzyme activity, and is sometimes described as a “hand in glove” relationship. The assays Hellinga’s papers reported using would use the “coupling enzyme” to track the production of D-GAP (as a way to measure the activity of TIM in converting DHAP to D-GAP). However, the NovoTIM would convert DHAP to L-GAP — the left hand to D-GAP’s right hand. And, the “left hand” (L-GAP) wouldn’t fit properly into the “coupling enzyme” glove that was specific for the “right hand” (D-GAP).
How, in other words, could the NovoTIM have yielded any activity at all with this assay? If the only thing in the samples with enzyme activity was the wild-type enzyme from the bacteria, why didn’t it show the characteristic TIM activity in the assay?
All of these issues seemed to deepen the mystery of the initially published and now retracted results. Why should the assays described in the papers have given the reported results? Aside from the wild-type TIM contaminating them, what else was in those samples?
Hellinga himself seems not terribly curious about the answers to these questions.
Hellinga tells C&EN that he decided not to address such questions because he believes the fact that the designed enzyme ultimately didn’t show TIM activity made many other points moot. “I didn’t see a reason to go into the design methodology because the experiment clearly didn’t work,” he says. “By inference, obviously the design was wrong.” (41)
Manifestly, there are reasons to be curious about the withdrawn results. How else could researchers — in Hellinga’s research group or in other research groups — avoid similar mistakes in future experiments? Mightn’t following up on some of these puzzles lead to the discovery of something unexpected? (Perhaps the methods described didn’t result in the synthesis of the the target protein but yielded something else with interesting properties.)
Publishing your results is sharing your questions, your results, and any puzzle that may arise from them with your scientific community. Once you’ve shared them, it’s no longer just a question of what your original goals happened to be when you initiated the project. As the scientist who put those results into the conversation, you’re now on the hook to help answer the community’s questions about the results you published. This makes Hellinga’s apparent “moving on” attitude toward the legitimate puzzles arising from his results seem a little off.
That attitude cannot help but highlight another legitimate question around these results: can we trust you? Is yours a research group that runs good experiments, collects reliable data, and makes accurate reports to the rest of the scientific community? When the reports don’t hold up, will you be accountable to the community to figure out what went wrong so that we all will benefit from that information?
In part 2, when we look at the ways collaboration between scientists were at work in producing, and then toppling, these results, there will be more to say about trust and accountability.
 Celia Henry Arnaud, “Enzyme Design Papers Retracted,” Chemical & Engineering News, 86(18), 40-41 (May 5, 2008).
All quotations in the post are from this article, with page numbers given parenthetically.
 Erika Check Hayden, “Designer Debacle,” Nature, 451, 275-278 (May 15, 2008).
 Dwyer, M.A., Looger, L.L., and Hellinga, H.W., Science 304, 1967-1971 (2004).
 Allert, M., Dwyer, M.A., and Hellinga, H.W., Journal of Molecular Biology 366, 945-953 (2007).