Could you repeat that? Please speak louder.

By finchtalk on February 26, 2014.

In our series on why $1000 genomes cost $2000, I raised the issue that the $1000 genome is a value based on simplistic calculations that do not account for the costs of confirming the results. Next, I discussed how errors are a natural occurrence of the many processing steps required to sequence DNA and why results need to be verified. In this and follow-on posts, I will discuss the four ways (oversampling, technical replicates, biological replicates, and cross-platform replicates) that results can be verified as recommended by Robasky et. al. [1].

The game Telephone teaches us how a message changes as it is quietly passed from one person to another. In Telephone, mistakes in messages come from poor hearing, memory failures, and misunderstanding. By analogy, mistakes in DNA sequence results come from poor signals, artifacts introduced through many steps, and data processing. In Telephone terms, the errors we can identify and correct, through the four verification methods, can be expressed as:

Oversampling; “please speak louder”
Technical replicates; “please repeat that”
Biological replicates; “please repeat that again”
Cross-platform replicates; “get a second opinion”

This post will focus on oversampling.

Oversampling, just as it sounds, is the process of collecting data from a single sample such that the total amount of data is greater than the amount needed to measure the sample once; in other words, we amplify the overall signal to reduce noise. In DNA sequencing, the unit of data is a base (A, C, G, or T). The size of a sample is the total number of bases in its genome or the regions being sequenced. For example, the human genome has approximately six billion base pairs (6 Gbp) in 46 chromosomes (22 paired autosomes + sex chromosomes). However, the haploid genome, 3 Gbp, is often used as the measurement of genome length. An accepted practice for sequencing human genomes is to cover the genome 30 times (30x). At least 90 Gb of data are needed to achieve this coverage. The number of individual sequence reads required is the total number of bases divided by the length of the genome. Presently, Illumina reads of 150 bases are the most common. In this example, 600 million reads would be needed to achieve a 30x coverage.

Another term for coverage is depth. In order to detect variation in the DNA sequences, the reads must be aligned to a reference sequence. Visually, the aligned reads are presented as overlapping rectangles or, when viewed at the sequence level as individual lines. Reads mapping to common areas are stacked on top of one another, hence each point of the genome has a depth of coverage with respect to the number of bases that cover this point.

A visual representation of randomly distributed reads aligned to a reference sequence. Each rectangle is one read. A variant base is shown in blue. The bases highlighted in red are most likely errors. For additional technical discussion see: http://gatkforums.broadinstitute.org/discussion/2541/screenshot-info-snp-visible-in-igv-but-is-not-called-by-unifiedgenotyper. The full size image can be viewed at http://postimg.org/image/pxdxoyl5d/full

The basics of oversampling are simple, but how does it help verify results and reduce error? If each position within the read has a chance that the base determined for that position is an error, and the error occurs randomly, then additional measurements will ensure that most of the data have the correct base. To illustrate, if we collect 10 reads and the random error rate is 10%, then, on average, nine will be correct.

If it seems that 10x coverage, possibly lower, should be sufficient, why is 30x a common standard? The answer is that the occurrence of DNA sequencing errors is not uniform over the length of a read. The ends of a read tend to have more errors (lower quality) and the middles of reads have fewer errors (higher quality).

The fewest errors will occur if a genome is oversampled in the middle portions of reads. Achieving this goal requires that each read is obtained from a different place within the genome. Using the above example, a 10x oversampling would position each read every 15 bases if they were spaced evenly. After the first 10 reads we would have a 10x uniform depth across the entire genome. That is, if we could evenly sample the genome every 15 bases. Once again, statistics and the laws of physics make sure this won't happen.

In random sampling, events occur with different frequencies. In 1988, Eric Lander and Michael Waterman developed models for mapping DNA based on random sampling. Briefly, if a 500 million base pair genome is sequenced to a 10x coverage, 20,000 bases will be missed. For a 3 Gbp genome even more bases will be missed. And, when regions of lower coverage are factored in (to achieve uniform oversampling), a greater total depth of coverage is needed. Of course, these numbers are based on mathematical models. When biases related to DNA fragmentation, PCR, or cloning are considered, coverage needs to be further increased. Thus 30x coverage is an accepted standard, but it is not a standard that everyone agrees to for the reasons stated and other nuances related to DNA structure.

In summary, oversampling is good for reducing errors that occur in a random fashion. However, systematic errors that result from the local base compositions or instrument created artifacts will persist even when data are oversampled. Thus, other types of verification are needed.

References and further reading

[1] Robasky, K., Lewis, N. E.,, & Church, G. M. (2014). The role of replicates for error mitigation in next-generation sequencing. Nature Reviews Genetics DOI: 10.1038/nrg3655

Genome Sequencing Theory: http://en.wikipedia.org/wiki/DNA_sequencing_theory - Provides an overview random sequencing and Lander/Waterman and other references.

We have a new newsletter at Digital World Biology! Sign up here.

More like this

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

What is Biotech?

September 29, 2017

The biotechnology (biotech) industry is incredibly diverse. Recently, I wrote about the size of the biotech industry, which is, of course, related to how biotechnology is defined. As a strict definition, biotechnology is the use of biology to turn raw materials into useful products. However, the…

How Big is Biotech?

August 16, 2017

A simple web search says biotech is really big. One estimate indicates that the industry will have $400 billion in sales in 2017 with growth to over $775 billion by 2024 [1]. Another report suggests there are over 77,000 employers [2]. That’s big, but is it real, and what you can do with this…

BioDatabases 2017 - What's out there?

January 12, 2017

It's time for the annual blog about the annual Nucleic Acids Research (NAR) database issue. This is the 24th database issue for NAR and the seventh blog for @finchtalk. Like most years I have no idea what I'm going to write about until I start reading the new issue. Something always inspires me.…

Teach Biology? We want to learn about your use of computers in the classroom

April 13, 2016

Computers, biological data (molecular sequences, structures, and other data), websites, and databases are integral to modern research. Innovations like precision, or personalized medicine, expect a certain level of patient participation, and our future food and environmental sustainability will…

Bio Databases 2016

February 16, 2016

Someone missed the memo. Over the past year, news and presentations by NIH leaders like Philip Bourne have communicated that the proliferation biologically focused databases is unsustainable. However, unlike last year, where the number of databases tracked by Nucleic Acids Research (NAR) dropped by…