Could you repeat that please?

By finchtalk on February 10, 2014.

Previously, I introduced the idea that the $1000 genome has not been achieved because it is defined in simplistic terms that ignore many aspects of data completeness and verification. In that analysis, I cited a recent perspective by Robasky, Lewis, and Church [1] to present concepts related to the need to verify results and the general ways in which this is done. In this and the next few posts I will dig deeper into the elements of sequence data uncertainty and discuss how results are verified.

Agarose gel of DNA stained with ethidium bromide. By Mnolf, wikipedia.

First, we need to understand that sequence data always contains errors because the order of bases in DNA molecules are determined through indirect means. Since molecules are too small to be easily visualized in direct ways, we need to detect their presence by creating signals that can be amplified and measured. A simple way to create these signals is to increase the numbers of molecules that are measured. But, we can also increase the sensitivity of measurement through dyes and other means. For example, when restriction sites are mapped in DNA, DNA molecules are digested with a restriction enzyme and the resulting fragments are separated by electrophoresis in agarose gels and stained with ethidium bromide to visualize the bands by fluorescence under ultraviolet light. We can detect approximately 10 nanograms of DNA or approximately nine billion molecules of a 1000 bp fragment because each 1000 bp molecule can bind several hundred molecules of ethidium bromide, which fluoresces 20 times brighter when it is bound to DNA. If the DNA molecules are labeled with radioactive chemicals, sensitivity can be increased further. This more sensitive detection method picks up the presence of uncut molecules, and shows us that not everything gets cut as expected. Chemistry is never perfect. Moreover, as signals are amplified, or detection becomes more sensitive, artifacts due to contamination and statistical variations in chemical processes create noise that can mask true events.

Cartoon of DNA replication. By Madprime, wikipedia

Thus, errors are due to the non-deterministic nature of chemistry, the sensitivity of the measurement system, and the specificity of detecting signal instead of noise. In DNA sequencing, all of these factors contribute to errors. With some exceptions, the current sequencing processes are based on DNA replication and have four primary components: 1) a DNA template, the molecule that will be “read,” 2) a short (20-40 bases) synthetic DNA molecule that hybridizes to a known sequence within the template to create a starting point for the new strand of DNA, 3) a polymerase enzyme to build a complementary DNA molecule from instructions in the template and, 4) nucleotide triphosphates (the bases), the building blocks that form the new molecule. The bases also include nucleotides that are modified to create observable signals.

The above process mimics the biochemical process of DNA replication. At its finest detail it is a series of high speed chemical reactions that complete with varying degrees of efficiency as determined by highly local environments that are created by the structure and interactions between the bases in the template, the polymerase, and the relative concentrations of bases being incorporated. Fundamentally, each chemical reaction is governed by the laws of physics. That means it can work with 100% certainty. Indeed, some mutations are simply due to replication errors that occur when chromosomes are duplicated. If it were not for biology's elaborate error correcting enzymes the spontaneous mutation rate between parents and offspring would be much higher than is observed. Some pathogens are able to use these replication errors to their advantage by mutating and evading immune systems and developing resistance to drugs.

From the above discussion, we can see that DNA sequencing cannot be perfect. Since sequencing reactions can only approximate biological processes, they will have greater error rates than the native system. The modified nucleotides needed to amplify signals, non-native reaction environments and other factors contribute to increased error rates. In some cases these errors occur randomly, and other times they are systematic. That is, the local sequence of the template affects how a modified base is incorporated, or theoretical assumptions about the reaction and chemical detection method do not adequately model what happens in the real world. Thus, each kind of sequencing method can have a unique error profile and the ability to generate highly accurate data requires that the sequencing process be deeply characterized and the results verified by repeating the processes or getting the same result in a different way. Finally, as in our restriction enzyme digest example, a certain number of DNA template molecules are needed to observe signal. Contaminants and errors introduced during DNA purification or amplification steps can introduce additional errors.

It is important to note that errors occur at low levels. Sequencing does work, and the large majority of the data are accurate. However, when the results have potentially significant clinical implications, or suggest novel scientific insights, further verification is required. This verification can take several forms and those will be discussed over the next few posts.

[1] Robasky, K., Lewis, N. E.,, & Church, G. M. (2014). The role of replicates for error mitigation in next-generation sequencing. Nature Reviews Genetics DOI: 10.1038/nrg3655

More like this

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

What is Biotech?

September 29, 2017

The biotechnology (biotech) industry is incredibly diverse. Recently, I wrote about the size of the biotech industry, which is, of course, related to how biotechnology is defined. As a strict definition, biotechnology is the use of biology to turn raw materials into useful products. However, the…

How Big is Biotech?

August 16, 2017

A simple web search says biotech is really big. One estimate indicates that the industry will have $400 billion in sales in 2017 with growth to over $775 billion by 2024 [1]. Another report suggests there are over 77,000 employers [2]. That’s big, but is it real, and what you can do with this…

BioDatabases 2017 - What's out there?

January 12, 2017

It's time for the annual blog about the annual Nucleic Acids Research (NAR) database issue. This is the 24th database issue for NAR and the seventh blog for @finchtalk. Like most years I have no idea what I'm going to write about until I start reading the new issue. Something always inspires me.…

Teach Biology? We want to learn about your use of computers in the classroom

April 13, 2016

Computers, biological data (molecular sequences, structures, and other data), websites, and databases are integral to modern research. Innovations like precision, or personalized medicine, expect a certain level of patient participation, and our future food and environmental sustainability will…

Bio Databases 2016

February 16, 2016

Someone missed the memo. Over the past year, news and presentations by NIH leaders like Philip Bourne have communicated that the proliferation biologically focused databases is unsustainable. However, unlike last year, where the number of databases tracked by Nucleic Acids Research (NAR) dropped by…