Genomics: Speed Kills, and Not in a Good Way (On SNPs and Genomic Epidemiology)

I probably shouldn't read two excellent and critical genomics posts back to back. But this excellent post by Daniel MacArthur about the issues surrounding the sequencing using Ion Torrent of a human genome got me thinking about this post by Marian Turner about the pettiness issues surrounding publication of the outbreak strains (italics mine):

The collaborative atmosphere that surrounded the public release of genome sequences in the early weeks of this year's European Escherichia coli outbreak has turned into a race for peer-reviewed publication.

A paper published in PLoS One today, by Dag Harmsen from the University of Münster, Germany, and his colleagues, contains the first comparative analysis of the sequence of this year's E. coli outbreak strain (called LB226692 in the publication) and a German isolate from 2001 (called 01-09591), which was held by the E. coli reference laboratory at the University of Münster, headed by Helge Karch. The scientists also compared the two strains with the publicly available genome of strain 55989, isolated in central Africa in the 1990s.

The LB226692 and 01-09591 genomes were sequenced using an Ion Torrent PGM sequencer from Life Technologies of Carlsbad, California (see 'Chip chips away at the cost of a genome'). The authors say that their publication is the first example of next-generation, whole-genome sequencing being used for real-time outbreak analysis. "This represents the birth of a new discipline -- prospective genomics epidemiology," says Harmsen. He predicts that this method will rapidly become routine public-health practice for outbreak surveillance.

I want to leave the whole issue of public domain analysis and the race to publication for another post, and focus on outbreak surveillance (tomorrow, I'll put on my ranty pants). I agree that a technology like Ion Torrent can provide a rapid overview of what the outbreak organism is. After all, when I argued that we had seen something very similar to this organism before (despite WHO and others' claims to the contrary) used Ion Torrent data. But there were problems with that dataset that I identified at that time while looking at less than 0.1% of the data (~3500 bp out of more than 5 million bases).

So to make that claim, you have to define genomic epidemiology as "really fast sequencing." But that's just, well, really fast genome sequencing. (And assembly, annotation, and comparing two lists of genes). To me, genomic epidemiology means being able to differentiate very closely related strains in order to track the spread of the outbreak. And that's a very different beast. To do that, we would typically* need to identify SNPs--single nucleotide polymorphisms, which are changes in the smallest units of DNA. This is very hard, and to date, I haven't read any papers related to the O104:H4 outbreak where that's been done.

By "been done", I mean that these SNPs have been verified (ideally, through manual sequencing)**. Because if you don't do that, you'll have problems:

Even with a finished genome (the highest quality), depending on whom you talk to and how paranoid confident you are, the error rate is between 1 in 100,000 to 1 in 1,000,000 per base (the smallest subunit of DNA; e.g., "A", "T", etc.). That sounds good until you realize that a typical E. coli genome is around five million bases long. That means, in the best case scenario, if you were to sequence the same exact genome twice, your two sequences should differ by ten to 100 changes ('SNPs', short for single nucleotide polymorphism). Keep in mind, this is the exact same organism.

Ewan Birney describes a host of related issues (italics mine):

1. No large scale dataset is perfect - and many are at some distance from perfection. This is true of everything from assemblies through to gene sets to transcription factor binding sites to phenotypes generated cohorts. For people who come from a more focused, single area of biology where in a single experiment you can have mastery of every component if desired this ends up being a bit weird - whenever you dig into something you'll find some miscalls, errors or general weirdness. Welcome to large scale biology.

2. When you do predictions, favour specificity over sensitivity. Most problems in genomics are bounded by the genome/gene set/protein products, so the totally wide "capture everything" experiment (usually called "a genome-wide approach") has become routine. It is rarely the case (though not never) that one wants a high sensitivity set which is not genome-wide. This means for prediction methods you want to focus on specificity (ie, driving down your error rate) as long as one is generating a reasonable number of predictions (>1000 say) and, of course, cross validating your method.

3. When you compare experiments you are comparing the combination of the experiment and the processing. If the processing was done in separate groups, in particular with complex scripting or filtering, expect many differences to be solely due to the processing.

4. Interesting biology is confounded with Artefact (1). Interesting biology is inevitably about things which are outliers or form a separate cluster in some analysis. So are artefacts in the experimental process or bioinformatics process - everything from biases towards reference genome, to correlations of signals actually being driven by a small set of sites.

5. Interesting biology is confounded with Artefacts (2). There is a subset to the above which is so common to be worth noting separately. When you have an error rate - and everything has an error rate due to point 1 - the errors are either correlated with biology classification (see point 2) or uniform. Even when they are uniform, you still get mislead because often you want to look at things which are rare - for example, homozygous stop codons in a whole genome sequencing run, or lack of orthologs between species. The fact that the biological phenomena you are looking for is rare means that you enrich for errors.

6. Interesting biology is usually hard to model as a formal data structure and one has to make some compromises just to make things systematic.

The point is, if we want to track this outbreak, we need verified SNPs. We also need to control for different centers' assembly processes as part of that verification, if we were to use those data*** (as I noted above, short of 'finishing' the genome which will take weeks or months per genome, every center will be slightly 'wrong').

Finally, we need more strains. I'm not sure how two or four strains tells us anything about the spread of the disease (pretty sure it doesn't).

So to date, I don't think we've done genomic epidemiology of this outbreak. We have done rapid sequencing--genomic identification (and who should get the credit for that will be discussed tomorrow. RANTY PANTS!). That rapid sequencing has been a critical public health response, but I wouldn't call it genomic epidemiology.

*You might get lucky and find some large scale change, such as gain or loss of genes, acquisition of a phage or plasmid, and so on. Usually, the biology doesn't cooperate.

**You could throw another sequencing technology at it too, and simply take the union of the two sets of SNPs. This gets to the sensitivity vs. specificity part of Birney's discussion of genomic data. Of course, all high throughput sequencing technologies have their faults...

***The projects have released their data, so this can be done, at least on the informatics/assembly side.

More like this

Update/clarification: I want to clarify something critical. This is not about picking on a researcher or a country. It very well could have happened in the U.S. or anywhere else. I, nor you the reader, have any idea about the internal constraints these groups experience, or what was communicated…
I recently was in a conversation with a collaborator who isn't in the genomics biz, and said collaborator remarked that there was a lot of online criticism of the quality of the genomic data that has been generated for the E. coli O104:H4 outbreak isolates. I've been following it very closely (not…
In this story about the use of rapid genomic sequencing to monitor a hospital outbreak of multi-drug resistant Klebsiella pneumoniae (more details here), this end bit is interesting (italics mine): "In the E. coli outbreak, we had enough E. coli reference strains and knew enough about E. coli…
...in Europe. I'll get to that in a moment. You've probably heard of the E. coli outbreak sweeping through Germany and now other European countries that has caused over one thousand cases of hemolytic uremic syndrome ('HUS'). What's odd is that the initial reports are calling this a novel hybrid…

So has the potential for genomics in epidemiology been oversold? The sheer gee-whiz factor of the technology is undeniable, but it still came down to old fashioned shoe leather to trace the contamination back to its source. At the end of the day, it usually comes down to misplaced feces, where it's occurring, and how to prevent it, regardless of what strains may be present. Also important to keep in mind that tort liability exposure has a substantial impact on food safety efforts by industry and the extent of that exposure depends critically on the weight of evidence required to establish a link in court between human and food isolates.

I don't think that the potential for genomics in epidemiology has been oversold. But there is a key step that's important to recognize, and that I think sometimes gets lost in the era of fast sequencing, administrative claims databases, Google Correlate, etc:

- Data is not science

Data is a component of science, but a sequence, or even the comparison of two sequences, is no more sound, grounded and well produced epidemiology than an unanalyzed cohort dataset. Or a raw time-series.