Public Data, Publishing, Priority, and Public Health

Update/clarification: I want to clarify something critical. This is not about picking on a researcher or a country. It very well could have happened in the U.S. or anywhere else. I, nor you the reader, have any idea about the internal constraints these groups experience, or what was communicated to government officials. To the extent that data sharing didn't occur due to concerns over publication, this represents an instance where the publication process--and the import attributed to it--affected the need for rapid release. That's the key point, not assigning blame to individuals or countries. Take the personal criticism and jingoism somewhere else.

Update II: See this comment. A draft assembly of the 2011 strain was released by the group mentioned in the Nature article. This is my error.

Yesterday, I discussed what I thought the implications of the rapid data release are for genomic epidemiology. I also promised a rant about the race to publication surrounding the O104:H4 E. coli outbreak in Germany--and who am I to disappoint? Before donning my ranty pants, however, it's worth recognizing the importance of the rapid release of both raw data and genomic assemblies, first by BGI, and then by HPA. That public release, more than anything, provided useful and timely information to scientists about this outbreak.

OK, with that out of the way (for now), I got my ranty pants on. Let's revisit Marian Turner's Nature news article (italics mine):

The collaborative atmosphere that surrounded the public release of genome sequences in the early weeks of this year's European Escherichia coli outbreak has turned into a race for peer-reviewed publication....

The LB226692 and 01-09591 genomes were sequenced using an Ion Torrent PGM sequencer from Life Technologies of Carlsbad, California.... The authors say that their publication is the first example of next-generation, whole-genome sequencing being used for real-time outbreak analysis. "This represents the birth of a new discipline -- prospective genomics epidemiology," says Harmsen. He predicts that this method will rapidly become routine public-health practice for outbreak surveillance.

But Harmsen's group was pipped to the publishing post by Rolf Daniel and his colleagues at the University of Göttingen in Germany, who published a comparison of the sequence of two isolates from the outbreak with the 55989 strain in Archives of Microbiology on 28 June. Harmsen says that this competition is why his group did not release the 2001 strain sequence before today's PLoS One publication.

Both groups say that their genomic sequencing and analysis were conducted independently. But their findings don't really differ from sequence analyses that other scientists were simultaneously documenting in the public domain, following the release, on 2 June, by China's BGI (formerly known as the Beijing Genomics Institute) of a full genome sequence of the outbreak strain -- also generated using Ion Torrent sequencing. These scientists say that there is very little information in either publication that was not previously available on their website. "The crowd-sourcing efforts arrived at almost all of the scientific conclusions about the strain comparisons first," says Mark Pallen from the University of Birmingham, UK, "so we're surprised and disappointed that these findings are not referred to in these papers."

Leaving aside the issues of priority and recognition, the critical thing is that these papers provided no understanding of the outbreak while it was happening. The early release of data (even before it reached the NIH/NCBI repository) by BGI and then HPA along with the assemblies did. If there are heroes in all of this, BGI and HPA are.

All of the analysis which helped us understand what this strain is happened weeks before publication. At this point, publication is just about keeping score.

(an aside: if groups are delaying publication because they're not trying to provide a rapid public health response, but very high quality data, such as improved genome assemblies or SNP verification, to inform basic research related to the outbreak, that is different. But the above quote is clear that the delay wasn't about data quality or even other issues, such as funders' stipulations.)

To claim you're first has as much to do with how rapidly journal editors and their staffs respond along with reviewers' requests for changes as it does any scientific ability: most genome centers can bang out a couple of pretty good bacterial genomes very quickly, along with assemblies and annotations, if they're so inclined. Comparing lists of genes and making some good figures isn't that hard either--hell, bloggers did that. On their own spare time. In this particular instance, claiming your group is 'first' is as ridiculous as those "baby on board" car signs, which imply that the vehicle's owners were the first people to invent screwing without using birth control. As the kids used to say, big whoop.

While publication is enshrined as the pinnacle of scientific communication (although that might be changing), in this case, it was pretty much irrelevant--and it appears to have slowed data release. Worse, the race to publication means that, rather than collaborating and standardizing the data analysis (e.g., having the same data processing), the larger scientific community will be analyzing slightly different genomes due to processing, unless someone wants to rework everything from the beginning (if that's even possible). This is very helpful [/snark].

Finally, let's look at this again:

But Harmsen's group was pipped to the publishing post by Rolf Daniel and his colleagues at the University of Göttingen in Germany, who published a comparison of the sequence of two isolates from the outbreak with the 55989 strain in Archives of Microbiology on 28 June. Harmsen says that this competition is why his group did not release the 2001 strain sequence before today's PLoS One publication.

Wow. I could get really nasty, but I'll just speculate that if I were a German citizen and had read that, I would be unhappy. With forty people dead, and the possibility of a massive lawsuit from Spain, to worry about publication? To worry about coming in second? Jeepers. But the outbreak didn't kill any of my countrymen, so I'll leave any fury to the Germans. (Update: Unfortunately, I am wrong--it did kill one U.S. citizen)

(Of course, maybe one simply believes that one's research doesn't really matter. Or something.)

The point is not to call out any one person or group, but a system that didn't work so well. Any system that encourages and fosters this behavior in the face of a public health crisis needs a serious rethink. We should be seriously reconsidering what publication means in the context of a rapidly moving health crisis--and what that tells us about our current system of scientific communication.

More like this

FYI, "In the United States, six confirmed cases of STEC O104:H4 infections have been identified. Among these six cases, one death has been reported in an Arizona resident who traveled to Germany before becoming ill."
http://www.cdc.gov/ecoli/2011/ecolio104/

Dear Mad Mike,

please know the facts and read the papers before stating uninformed opinions. We were the first to release a draft assembly to the public early June 3rd via NCBI (http://www.ncbi.nlm.nih.gov/nuccore/334717079). BGI was about 23h after us, but did not added more information. On June 6th BGI released Illumina single reads, that had much higher coverage, but essentially gave again not much more information. HPA released on June 10th their 454 mate-pair data to the public and that was a major improvement. Furthermore, it was my partners (Helge Karch and Alexander Mellmann) who come up already on May 30 with a rapid test (based on stx2, terD, rfbO104, and fliC H4) via press release and protocols distributed over the web (later published in Lancet Infect. Dis.). It was this test that helped from microbiology point of view to trace the source of the outbreak, i.e. sprouts.
One final comment about collaborative computing or crowd-sourcing how they call it. Saying everything was already known when we published is simply a exorbitance (and can not be true because we and others never released before the sequence of the historic isolate). We, BGI, and HPA allowed them to "play around" by releasing our data. It took us 8 reviewer answers before we published in PLoS that takes some time! Furthermore, who takes responsibility for the "crowd-sourcing" content? Who brings all the pieces of different quality together? What about sustainability of the data published in some blogs? The community approach was an interesting phenomena but many issues need to be solved before this might be in the future a valid approach! This time it contributed nothing at all to solving public health issues.

By Dag Harmsen (not verified) on 27 Jul 2011 #permalink

Dag,

You are correct and I was in error: an assembly that I believe was 7x average coverage was mapped against the 55989 reference was released June 2.

I assume then, that you were misquoted or taken out of context in the Nature story?

"Any system that encourages and fosters this behavior in the face of a public health crisis needs a serious rethink."
I have nothing to add about this particular instance, but would like this thinking applied to all medical research:
Failure to cough up your data slows progress, and so is a disservice to patients.
Making me add you as a co-author to see your data is extortion.
Agreeing to such extortion is unethical, as it rewards the slime to continue in their slimy ways. So resist.