There's No Such Thing As Perfect Genome Sequence

By mikethemadbiologist on July 11, 2011.

I recently was in a conversation with a collaborator who isn't in the genomics biz, and said collaborator remarked that there was a lot of online criticism of the quality of the genomic data that has been generated for the E. coli O104:H4 outbreak isolates. I've been following it very closely (not surprised by that, are you?), and I'm not sure what the collaborator was referring to. On some blog, in some comment, there probably is criticism, but these are the intertoobz: that sort of thing happens.

But then it dawned on me that much of what appears to be 'criticism' is probably just a realistic assessment of the quality of genomic data. To those not in the genomics biz, it probably looks pretty bad. For instance, in my first post about E. coli O104:H4, I described the possibility of errors:

In the outbreak strain, the icd allele matches icd136 exactly; however, the genome sequence lacks the last two bases. Given that the genome assembly is in over 3,000 pieces ('contigs'), I think this is missing data, not biology.

...In the outbreak strain, the recA allele differs from recA7 by one insertion. "Jan 91" has a sequence of AAAA, while the outbreak strain has a sequence of "AAAAA" (below, it's recorded as "aAAAA" to indicate the difference). With Ion Torrent (and other high throughput sequencing technologies), when you have 'runs' of the same nucleotide, such as "AAAA", it's not unusual for a base to be added or deleted, which could yield a 'false' "AAAAA." This could be sequencing error, but I can't rule out a real insertion (i.e., an extra A that's real).

This isn't criticism of any scientist's skill, but simply an acknowledgement of the limitations of the data. Some types of sequencing yield certain 'stereotypical' error. In any scientific discipline where you generate large (or humongous) datasets, there's no conceivable way to manually check all of the data for these errors--instead, you have to acknowledge its limitations.

Even with a finished genome (the highest quality), depending on whom you talk to and how ~~paranoid~~ confident you are, the error rate is between 1 in 100,000 to 1 in 1,000,000 per base (the smallest subunit of DNA; e.g., "A", "T", etc.). That sounds good until you realize that a typical E. coli genome is around five million bases long. That means, in the best case scenario, if you were to sequence the same exact genome twice, your two sequences should differ by ten to 100 changes ('SNPs', short for single nucleotide polymorphism)*. Keep in mind, this is the exact same organism.

ZOMG!!! TEH GENOMZ R FALSE?!?

Actually, not so much. We can still make sense of the data. In one comment thread I saw a couple of weeks ago (and I don't remember where it is), one commenter claimed that sequencing multiple isolates from the O104:H4 outbreak was akin to doing a survey of sequencing error. After all, we expect maybe 20 - 30 changes at the most during the course of the outbreak, which is less than the error rate (unless, of course, this is really more than one outbreak, but even then there shouldn't be that many changes). So how do we find the signal in the noise?

That is precisely why we must sequence multiple isolates from the outbreak. If we find the same SNP in two or more genomes (e.g., most have a T at a given position but two have an A), that lends much more validity to that SNP. And if we find several SNPs traveling together in the same subset of strains, then that suggests those strains have inherited those SNPs via common descent (TEH DARWINISMZ!!)**, and that they're not errors. Admittedly, unique SNPs--those that are only seen in one strain--are hard to pin down. But these data are useful.

Anyway, no genome sequence, even a finished one, is perfect. But we can still do good science, even as we recognize the flaws in the data.

*The number of errors is double the error rate since each sequence differs from the 'real' genome by the error rate, since we assume the same error doesn't occur in both sequences (if they occur at random this is very unlikely).

**Alternatively, you can believe in a capricious deity who likes fucking with us, especially when people are dying from infectious disease.

More like this

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

Program Announcement: I'm Moving

September 1, 2011

I've dropped some hints in the past that my relationship with ScienceBlogs would be...altered. Well, I've decided to leave. Mostly, it had to do with the issue of pseudonymity, although I'm very excited to hang out my own shingle once again. I don't want to rehash the issue of pseudonymity,…

Note to Unions: This Is Not How You Build a Coalition

September 1, 2011

The old saw that 'we hang together or we get hung separately' is a perfect description of how the left has disintegrated into irrelevance. Too often, groups will focus on modest gains for their own narrow constituency, while selling out other allies. Over the long term, each component of the…

Links 8/31/11

August 31, 2011

Links for you. Science: Underground river 'Rio Hamza' discovered 4km beneath the Amazon What do accommodationists do about creationist politicians? I've Been Told You Can Get Flu From the Flu Shot: False! Federal Work Suspension of Leading Arctic Scientist Ended as Investigation of His…

Meet the New New Math, Same As the Old New Math? What We Can Learn from Finland

August 31, 2011

Recently, The New York Times published an op-ed calling for curricular changes in K-12 math education: Today, American high schools offer a sequence of algebra, geometry, more algebra, pre-calculus and calculus (or a "reform" version in which these topics are interwoven). This has been codified by…

Links 8/30/11

August 30, 2011

Links for you. Another Scientist Calls Out Sen. Coburn's Misleading, Juvenile "Report" XMRV: ITS EVERYWHERE! UUUUUGH! ITS IN MY RACCOON WOUNDS! AND MY QIAGEN COLUMNS! Coulter Goes All Science-y in Bid to Disprove Evolution Yet another bad day for the anti-vaccine movement 2011 Antibiotics: Killing…