$1000 Genomes for $2000

By finchtalk on January 27, 2014.

Getting an accurate genome sequence requires that you collect the data at least twice argue Robasky, Lewis, and Church in their recent opinion piece in Nat. Rev. Genetics [1].

The DNA sequencing world kicked off 2014 with an audacious start. Andrew Pollack ran an article in the New York Times implying that 100,000 genomes will be the new norm in human genome sequencing projects [2]. The article focused on a collaboration between Regeneron and Geisinger Health in which they plan to sequence the exomes (the ~2% of the genome that encodes proteins and some non-coding RNA) of 100,000 individuals. In addition to this project, several others were cited in the article.

Next, Illumina claimed they can achieve the $1000 genome at the annual JP Morgan investor conference when they introduced their new sequencing instrument, the HiSeq X Ten. Ten is the magic number because you must buy ten, at $1 million/instrument, to have the opportunity for $1000 genomes. Illumina claims their $1000 cost includes sample prep and amortization costs. The folks at the AllSeq blog estimate that the total investment is really $72 million since it will take 72 genomes, collected over four years, to achieve the amortized costs of $1000 per genome.

Unfortunately the above estimates are based on getting data from samples that are sequenced only once. Therein lies the rub. According to Robasky and team, sequencing genomes with high accuracy requires that they be sequenced, minimally, in duplicate. While some sequencing technologies claim they can produce data with errors as low at one in 10 million bases, a six billion genome sequence will still contain thousands of false positive variants. Several aspects of the sequencing process contribute to this error including purifying DNA, preparing DNA for sequencing, collecting sequence data, and comparing the resulting data to the reference sequence to identify variants (bases that differ between sample and reference).

The authors how explain that some errors occur through random statistical variation (stochasitc) while others occur because of systematic biases in the different processes, and propose that collecting data in a replicated fashion is a cost effective way to reduce errors. Indeed, a current standard of practice is confirm variants observed by massively parallel next generation sequencing (NGS) by sequencing small regions containing the variant using capillary electrophoresis (Sanger). This is an expensive approach because it requires individual regions be isolated and sequenced in more laborious ways. As NGS sequencing costs drop, however, labor intensive confirmation methods become less attractive, and replicates become more feasible.

The paper describes four different kinds of assay replication methods: read depth (oversampling), technical, biological, and cross-platform, and discussed their strengths and weaknesses in term of error reduction and cost. The authors also describe the kinds of errors that have been observed. However, relative to technical advancements, these observations are out of date and published analyses of current error sources are lacking. Some issues continue to exist, others may have been solved, and new ones are likely, so labs establishing sequencing services, especially in clinical arenas, need to have strategies to identify and reduce errors in the data. Finally, the authors make an additional important point that errors related to data processing, limitations of read (collected sequence) length and completeness of reference materials cannot be addressed by replicates alone. New technological solutions will be needed.

So, what does this mean?

Human sequencing costs

First, DNA sequencing costs (as tracked by the National Human Genome Research Institute [NHGRI]), which have held constant at about $6000/genome since April 2012, may drop with Illumina’s new instruments. However obtaining a $1000 genome requires a volume-based model and a significant investment in optimization. Even then, obtaining highly accurate, high confidence data, will require replicates, perhaps several kinds, that differ in purpose and range in cost. Second, to truly understand sequence variation in the human population, very large studies , on the order of 100,000 individuals, are needed. Third, the current definition of a $1000 genome is simply the cost of collecting a human genome’s equivalence worth of data without deep data analysis. This practice results in genome sequences are neither whole, nor understood. The good news is, despite the substantial investment (around $100 million) required, sequencing a few hundred thousand genomes is only a fraction of the cost of the first genome (~$2.7 billion). Perhaps large-scale exome sequencing projects [2] seeking to understand the relationship between very small regions of sequence variation their impact on health and disease can now consider more realistic genome-wide approaches.

References:

[1] Robasky, K., Lewis, N. E.,, & Church, G. M. (2014). The role of replicates for error mitigation in next-generation sequencing. Nature Reviews Genetics DOI: 10.1038/nrg3655

[2] Pollack, B. A. (2014, January 13). Aiming to push genomics forward in new study. The New York Times.

More like this

Thanks for this insightful analysis, Sandra! While I am a co-first author on the NRG article, I'm not the corresponding author, so I hope you will nonetheless allow me to clarify and comment:

Claims have been made about technologies that can reduce error rates to as low as one in a several million bases; however, even such a low rate requires researchers to sift through hundreds (not thousands) of false positives. More commonplace technology have published error-rates that are orders of magnitude higher, yielding thousands of what I will call "sources of variation" in the over 3 billion base pairs from a typical whole human genome. These sources may be from biologically-relevant somatic variation which obfuscates the genotype, or they may be from error. Either way, if the goal is genotyping, then the analyst typically will threshold at a p-value derived from base-calling scores. The alternative thresholding mechanism offered in the paper also accounts for other sources of error that might not be quantified by base-calling scores, and it is a method that can be used for a replicate-set from any platform. The implication is that perhaps three 20x replicates are better than a single 60x run. I anticipate the day that single-cell sequencing becomes commonplace, at which point perhaps one can use replicates to distinguish between somatic variation and sequencer error, but until then, deep sequencing will remain the most common method for finding somatic variation.

Hi Kimberly, thanks for the comment and additional details.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

What is Biotech?

September 29, 2017

The biotechnology (biotech) industry is incredibly diverse. Recently, I wrote about the size of the biotech industry, which is, of course, related to how biotechnology is defined. As a strict definition, biotechnology is the use of biology to turn raw materials into useful products. However,…

How Big is Biotech?

August 16, 2017

A simple web search says biotech is really big. One estimate indicates that the industry will have $400 billion in sales in 2017 with growth to over $775 billion by 2024 [1]. Another report suggests there are over 77,000 employers [2]. That’s big, but is it real, and what you can do with this…

BioDatabases 2017 - What's out there?

January 12, 2017

It's time for the annual blog about the annual Nucleic Acids Research (NAR) database issue. This is the 24th database issue for NAR and the seventh blog for @finchtalk. Like most years I have no idea what I'm going to write about until I start reading the new issue. Something always inspires me.…

Teach Biology? We want to learn about your use of computers in the classroom

April 13, 2016

Computers, biological data (molecular sequences, structures, and other data), websites, and databases are integral to modern research. Innovations like precision, or personalized medicine, expect a certain level of patient participation, and our future food and environmental sustainability…

Bio Databases 2016

February 16, 2016

Someone missed the memo. Over the past year, news and presentations by NIH leaders like Philip Bourne have communicated that the proliferation biologically focused databases is unsustainable. However, unlike last year, where the number of databases tracked by Nucleic Acids Research (NAR)…