Previously, I introduced the idea that the $1000 genome has not been achieved because it is defined in simplistic terms that ignore many aspects of data completeness and verification. In that analysis, I cited a recent perspective by Robasky, Lewis, and Church  to present concepts related to the need to verify results and the general ways in which this is done. In this and the next few posts I will dig deeper into the elements of sequence data uncertainty and discuss how results are verified.
First, we need to understand that sequence data always contains errors because the order of bases in DNA molecules are determined through indirect means. Since molecules are too small to be easily visualized in direct ways, we need to detect their presence by creating signals that can be amplified and measured. A simple way to create these signals is to increase the numbers of molecules that are measured. But, we can also increase the sensitivity of measurement through dyes and other means. For example, when restriction sites are mapped in DNA, DNA molecules are digested with a restriction enzyme and the resulting fragments are separated by electrophoresis in agarose gels and stained with ethidium bromide to visualize the bands by fluorescence under ultraviolet light. We can detect approximately 10 nanograms of DNA or approximately nine billion molecules of a 1000 bp fragment because each 1000 bp molecule can bind several hundred molecules of ethidium bromide, which fluoresces 20 times brighter when it is bound to DNA. If the DNA molecules are labeled with radioactive chemicals, sensitivity can be increased further. This more sensitive detection method picks up the presence of uncut molecules, and shows us that not everything gets cut as expected. Chemistry is never perfect. Moreover, as signals are amplified, or detection becomes more sensitive, artifacts due to contamination and statistical variations in chemical processes create noise that can mask true events.
Thus, errors are due to the non-deterministic nature of chemistry, the sensitivity of the measurement system, and the specificity of detecting signal instead of noise. In DNA sequencing, all of these factors contribute to errors. With some exceptions, the current sequencing processes are based on DNA replication and have four primary components: 1) a DNA template, the molecule that will be “read,” 2) a short (20-40 bases) synthetic DNA molecule that hybridizes to a known sequence within the template to create a starting point for the new strand of DNA, 3) a polymerase enzyme to build a complementary DNA molecule from instructions in the template and, 4) nucleotide triphosphates (the bases), the building blocks that form the new molecule. The bases also include nucleotides that are modified to create observable signals.
The above process mimics the biochemical process of DNA replication. At its finest detail it is a series of high speed chemical reactions that complete with varying degrees of efficiency as determined by highly local environments that are created by the structure and interactions between the bases in the template, the polymerase, and the relative concentrations of bases being incorporated. Fundamentally, each chemical reaction is governed by the laws of physics. That means it can work with 100% certainty. Indeed, some mutations are simply due to replication errors that occur when chromosomes are duplicated. If it were not for biology's elaborate error correcting enzymes the spontaneous mutation rate between parents and offspring would be much higher than is observed. Some pathogens are able to use these replication errors to their advantage by mutating and evading immune systems and developing resistance to drugs.
From the above discussion, we can see that DNA sequencing cannot be perfect. Since sequencing reactions can only approximate biological processes, they will have greater error rates than the native system. The modified nucleotides needed to amplify signals, non-native reaction environments and other factors contribute to increased error rates. In some cases these errors occur randomly, and other times they are systematic. That is, the local sequence of the template affects how a modified base is incorporated, or theoretical assumptions about the reaction and chemical detection method do not adequately model what happens in the real world. Thus, each kind of sequencing method can have a unique error profile and the ability to generate highly accurate data requires that the sequencing process be deeply characterized and the results verified by repeating the processes or getting the same result in a different way. Finally, as in our restriction enzyme digest example, a certain number of DNA template molecules are needed to observe signal. Contaminants and errors introduced during DNA purification or amplification steps can introduce additional errors.
It is important to note that errors occur at low levels. Sequencing does work, and the large majority of the data are accurate. However, when the results have potentially significant clinical implications, or suggest novel scientific insights, further verification is required. This verification can take several forms and those will be discussed over the next few posts.
 Robasky, K., Lewis, N. E.,, & Church, G. M. (2014). The role of replicates for error mitigation in next-generation sequencing. Nature Reviews Genetics DOI: 10.1038/nrg3655