Getting an accurate genome sequence requires that you collect the data at least twice argue Robasky, Lewis, and Church in their recent opinion piece in Nat. Rev. Genetics [1].
The DNA sequencing world kicked off 2014 with an audacious start. Andrew Pollack ran an article in the New York Times implying that 100,000 genomes will be the new norm in human genome sequencing projects [2]. The article focused on a collaboration between Regeneron and Geisinger Health in which they plan to sequence the exomes (the ~2% of the genome that encodes proteins and some non-coding RNA) of 100,000 individuals. In addition to this project, several others were cited in the article.
Next, Illumina claimed they can achieve the $1000 genome at the annual JP Morgan investor conference when they introduced their new sequencing instrument, the HiSeq X Ten. Ten is the magic number because you must buy ten, at $1 million/instrument, to have the opportunity for $1000 genomes. Illumina claims their $1000 cost includes sample prep and amortization costs. The folks at the AllSeq blog estimate that the total investment is really $72 million since it will take 72 genomes, collected over four years, to achieve the amortized costs of $1000 per genome.
Unfortunately the above estimates are based on getting data from samples that are sequenced only once. Therein lies the rub. According to Robasky and team, sequencing genomes with high accuracy requires that they be sequenced, minimally, in duplicate. While some sequencing technologies claim they can produce data with errors as low at one in 10 million bases, a six billion genome sequence will still contain thousands of false positive variants. Several aspects of the sequencing process contribute to this error including purifying DNA, preparing DNA for sequencing, collecting sequence data, and comparing the resulting data to the reference sequence to identify variants (bases that differ between sample and reference).
The authors how explain that some errors occur through random statistical variation (stochasitc) while others occur because of systematic biases in the different processes, and propose that collecting data in a replicated fashion is a cost effective way to reduce errors. Indeed, a current standard of practice is confirm variants observed by massively parallel next generation sequencing (NGS) by sequencing small regions containing the variant using capillary electrophoresis (Sanger). This is an expensive approach because it requires individual regions be isolated and sequenced in more laborious ways. As NGS sequencing costs drop, however, labor intensive confirmation methods become less attractive, and replicates become more feasible.
The paper describes four different kinds of assay replication methods: read depth (oversampling), technical, biological, and cross-platform, and discussed their strengths and weaknesses in term of error reduction and cost. The authors also describe the kinds of errors that have been observed. However, relative to technical advancements, these observations are out of date and published analyses of current error sources are lacking. Some issues continue to exist, others may have been solved, and new ones are likely, so labs establishing sequencing services, especially in clinical arenas, need to have strategies to identify and reduce errors in the data. Finally, the authors make an additional important point that errors related to data processing, limitations of read (collected sequence) length and completeness of reference materials cannot be addressed by replicates alone. New technological solutions will be needed.
So, what does this mean?
First, DNA sequencing costs (as tracked by the National Human Genome Research Institute [NHGRI]), which have held constant at about $6000/genome since April 2012, may drop with Illumina’s new instruments. However obtaining a $1000 genome requires a volume-based model and a significant investment in optimization. Even then, obtaining highly accurate, high confidence data, will require replicates, perhaps several kinds, that differ in purpose and range in cost. Second, to truly understand sequence variation in the human population, very large studies , on the order of 100,000 individuals, are needed. Third, the current definition of a $1000 genome is simply the cost of collecting a human genome’s equivalence worth of data without deep data analysis. This practice results in genome sequences are neither whole, nor understood. The good news is, despite the substantial investment (around $100 million) required, sequencing a few hundred thousand genomes is only a fraction of the cost of the first genome (~$2.7 billion). Perhaps large-scale exome sequencing projects [2] seeking to understand the relationship between very small regions of sequence variation their impact on health and disease can now consider more realistic genome-wide approaches.
References:
[1] Robasky, K., Lewis, N. E.,, & Church, G. M. (2014). The role of replicates for error mitigation in next-generation sequencing. Nature Reviews Genetics DOI: 10.1038/nrg3655
[2] Pollack, B. A. (2014, January 13). Aiming to push genomics forward in new study. The New York Times.
- Log in to post comments
Thanks for this insightful analysis, Sandra! While I am a co-first author on the NRG article, I'm not the corresponding author, so I hope you will nonetheless allow me to clarify and comment:
Claims have been made about technologies that can reduce error rates to as low as one in a several million bases; however, even such a low rate requires researchers to sift through hundreds (not thousands) of false positives. More commonplace technology have published error-rates that are orders of magnitude higher, yielding thousands of what I will call "sources of variation" in the over 3 billion base pairs from a typical whole human genome. These sources may be from biologically-relevant somatic variation which obfuscates the genotype, or they may be from error. Either way, if the goal is genotyping, then the analyst typically will threshold at a p-value derived from base-calling scores. The alternative thresholding mechanism offered in the paper also accounts for other sources of error that might not be quantified by base-calling scores, and it is a method that can be used for a replicate-set from any platform. The implication is that perhaps three 20x replicates are better than a single 60x run. I anticipate the day that single-cell sequencing becomes commonplace, at which point perhaps one can use replicates to distinguish between somatic variation and sequencer error, but until then, deep sequencing will remain the most common method for finding somatic variation.
Hi Kimberly, thanks for the comment and additional details.