The exciting thing about the recent technological advances in genomics is that we have a massive amount of data. The terrifying thing about the recent technological advances in genomics is that we have a massive amount of data. A while ago, I brought this up in the context of bacterial genomics:
Most of the time, when you read articles about sequencing, they focus on the actual production of raw sequence data (i.e., ‘reads’). But that’s not the rate-limiting step. That is, we have now reached the point where working with the data we generate is far more time-consuming…
So, from a bacterial perspective, genome sequencing is really cheap and fast–in about a year, I conservatively estimate (very conservatively) that the cost of sequencing a bacterial genome could drop to about $1,500 (currently, commercial companies will do a high-quality draft for around $5,000- $6,000). We are entering an era where the time and money costs won’t be focused on raw sequence generation, but on the informatics needed to build high-quality genomes with those data.
Well, we’ve now reached the point where human genomes–which are about 1,000 times larger than bacterial genomes–are hitting the same wall:
“There is a growing gap between the generation of massively parallel sequencing output and the ability to process and analyze the resulting data,” says Canadian cancer research John McPherson, feeling the pain of NGS [next generation sequencing] neophytes left to negotiate “a bewildering maze of base calling, alignment, assembly, and analysis tools with often incomplete documentation and no idea how to compare and validate their outputs. Bridging this gap is essential, or the coveted $1,000 genome will come with a $20,000 analysis price tag.”
“The cost of DNA sequencing might not matter in a few years,” says the Broad Institute’s Chad Nusbaum. “People are saying they’ll be able to sequence the human genome for $100 or less. That’s lovely, but it still could cost you $2,500 to store the data, so the cost of storage ultimately becomes the limiting factor, not the cost of sequencing. We can quibble about the dollars and cents, but you can’t argue about the trends at all.”
There are a couple of issues wrapped up here:
1) Data storage. It’s not just holding onto the finished data, but also includes ‘working memory’ needed when processing and manipulating the data.
2) Analysis needs. You have eleventy gajillion genomes. Now what? Many of the analytical methods use ‘N-squared’ algorithms: that is, a ten-fold increase in data requires a 100-fold increase in computation. And that’s optimistic. Since I don’t see Moore’s law catching up to genomics, well, ever, barring a revolutionary breakthrough, we need to simplify and strip down a lot of analysis methods.
I think somebody should figure this out…