There's No Such Thing As Perfect Genome Sequence

I recently was in a conversation with a collaborator who isn't in the genomics biz, and said collaborator remarked that there was a lot of online criticism of the quality of the genomic data that has been generated for the E. coli O104:H4 outbreak isolates. I've been following it very closely (not surprised by that, are you?), and I'm not sure what the collaborator was referring to. On some blog, in some comment, there probably is criticism, but these are the intertoobz: that sort of thing happens.

But then it dawned on me that much of what appears to be 'criticism' is probably just a realistic assessment of the quality of genomic data. To those not in the genomics biz, it probably looks pretty bad. For instance, in my first post about E. coli O104:H4, I described the possibility of errors:

In the outbreak strain, the icd allele matches icd136 exactly; however, the genome sequence lacks the last two bases. Given that the genome assembly is in over 3,000 pieces ('contigs'), I think this is missing data, not biology.

...In the outbreak strain, the recA allele differs from recA7 by one insertion. "Jan 91" has a sequence of AAAA, while the outbreak strain has a sequence of "AAAAA" (below, it's recorded as "aAAAA" to indicate the difference). With Ion Torrent (and other high throughput sequencing technologies), when you have 'runs' of the same nucleotide, such as "AAAA", it's not unusual for a base to be added or deleted, which could yield a 'false' "AAAAA." This could be sequencing error, but I can't rule out a real insertion (i.e., an extra A that's real).

This isn't criticism of any scientist's skill, but simply an acknowledgement of the limitations of the data. Some types of sequencing yield certain 'stereotypical' error. In any scientific discipline where you generate large (or humongous) datasets, there's no conceivable way to manually check all of the data for these errors--instead, you have to acknowledge its limitations.

Even with a finished genome (the highest quality), depending on whom you talk to and how paranoid confident you are, the error rate is between 1 in 100,000 to 1 in 1,000,000 per base (the smallest subunit of DNA; e.g., "A", "T", etc.). That sounds good until you realize that a typical E. coli genome is around five million bases long. That means, in the best case scenario, if you were to sequence the same exact genome twice, your two sequences should differ by ten to 100 changes ('SNPs', short for single nucleotide polymorphism)*. Keep in mind, this is the exact same organism.


Actually, not so much. We can still make sense of the data. In one comment thread I saw a couple of weeks ago (and I don't remember where it is), one commenter claimed that sequencing multiple isolates from the O104:H4 outbreak was akin to doing a survey of sequencing error. After all, we expect maybe 20 - 30 changes at the most during the course of the outbreak, which is less than the error rate (unless, of course, this is really more than one outbreak, but even then there shouldn't be that many changes). So how do we find the signal in the noise?

That is precisely why we must sequence multiple isolates from the outbreak. If we find the same SNP in two or more genomes (e.g., most have a T at a given position but two have an A), that lends much more validity to that SNP. And if we find several SNPs traveling together in the same subset of strains, then that suggests those strains have inherited those SNPs via common descent (TEH DARWINISMZ!!)**, and that they're not errors. Admittedly, unique SNPs--those that are only seen in one strain--are hard to pin down. But these data are useful.

Anyway, no genome sequence, even a finished one, is perfect. But we can still do good science, even as we recognize the flaws in the data.

*The number of errors is double the error rate since each sequence differs from the 'real' genome by the error rate, since we assume the same error doesn't occur in both sequences (if they occur at random this is very unlikely).

**Alternatively, you can believe in a capricious deity who likes fucking with us, especially when people are dying from infectious disease.

More like this Europe. I'll get to that in a moment. You've probably heard of the E. coli outbreak sweeping through Germany and now other European countries that has caused over one thousand cases of hemolytic uremic syndrome ('HUS'). What's odd is that the initial reports are calling this a novel hybrid…
I probably shouldn't read two excellent and critical genomics posts back to back. But this excellent post by Daniel MacArthur about the issues surrounding the sequencing using Ion Torrent of a human genome got me thinking about this post by Marian Turner about the pettiness issues surrounding…
After Friday's post, I've held off on writing much about the German E. coli outbreak, often referred to by its serotype, O104:H4, or as HUSEC041 (HUS stands for hemolytic uremic syndrome). Having had the weekend to digest some of the ongoing analysis and news reports, here are some additional…
Update/clarification: I want to clarify something critical. This is not about picking on a researcher or a country. It very well could have happened in the U.S. or anywhere else. I, nor you the reader, have any idea about the internal constraints these groups experience, or what was communicated…