The Genomics Bottleneck: It's the Computation, Stupid

The exciting thing about the recent technological advances in genomics is that we have a massive amount of data. The terrifying thing about the recent technological advances in genomics is that we have a massive amount of data. A while ago, I brought this up in the context of bacterial genomics:

Most of the time, when you read articles about sequencing, they focus on the actual production of raw sequence data (i.e., 'reads'). But that's not the rate-limiting step. That is, we have now reached the point where working with the data we generate is far more time-consuming...

So, from a bacterial perspective, genome sequencing is really cheap and fast--in about a year, I conservatively estimate (very conservatively) that the cost of sequencing a bacterial genome could drop to about $1,500 (currently, commercial companies will do a high-quality draft for around $5,000- $6,000). We are entering an era where the time and money costs won't be focused on raw sequence generation, but on the informatics needed to build high-quality genomes with those data.

Well, we've now reached the point where human genomes--which are about 1,000 times larger than bacterial genomes--are hitting the same wall:

"There is a growing gap between the generation of massively parallel sequencing output and the ability to process and analyze the resulting data," says Canadian cancer research John McPherson, feeling the pain of NGS [next generation sequencing] neophytes left to negotiate "a bewildering maze of base calling, alignment, assembly, and analysis tools with often incomplete documentation and no idea how to compare and validate their outputs. Bridging this gap is essential, or the coveted $1,000 genome will come with a $20,000 analysis price tag."

"The cost of DNA sequencing might not matter in a few years," says the Broad Institute's Chad Nusbaum. "People are saying they'll be able to sequence the human genome for $100 or less. That's lovely, but it still could cost you $2,500 to store the data, so the cost of storage ultimately becomes the limiting factor, not the cost of sequencing. We can quibble about the dollars and cents, but you can't argue about the trends at all."

There are a couple of issues wrapped up here:

1) Data storage. It's not just holding onto the finished data, but also includes 'working memory' needed when processing and manipulating the data.

2) Analysis needs. You have eleventy gajillion genomes. Now what? Many of the analytical methods use 'N-squared' algorithms: that is, a ten-fold increase in data requires a 100-fold increase in computation. And that's optimistic. Since I don't see Moore's law catching up to genomics, well, ever, barring a revolutionary breakthrough, we need to simplify and strip down a lot of analysis methods.

I think somebody should figure this out...

More like this

Mike,
I think you are 100% on point. Was at NGS Data Mgmt Conference last week and this is starting to raise it's (ugly?) head. We're helping MS Research at SUNY Buffalo to deal with combinatorial explosion when you mathematically analyze multiple genotype expressions to phenotype to start to identify causality. Fortunately you can get supercomputer processing with cheaper storage today but this is a completely new market.

Thanks
Shawn Dolley
VP & GM, Health & Life Science
Netezza
Twitter: NZHealthLifeSci

Hello Mike,

Everyone that I spoke with at the Next Gen Seq & Data Mgmt Conference last week are challenged by the data explosion, or see the writing on a VERY LARGE wall - they have a 3rd gen seq system(s) on order. My colleague, Murali Ramanathan of the University at Buffalo, SUNY, presented at the conference his and colleagues work (the Ambience Suite of Algorithms and Methods) using novel, massively parallel FPGA enabled Data intensive Supercomputers (DISC) for Gene/Gene, Environment/Environment and Gene/Environment interaction analysis. The algorithms address the combinatorial explosion well and provide a means through information theoretic methods to come to rational conclusions. Using FPGA enabled DISCs the Ambience Suite execution and evaluation process is reduced from weeks to a few days. Actual compute processing is reduced from many, many hours to 11 minutes.

As the Co-Founder and Co-Director of the Data Intensive Discovery Initiative at the University at Buffalo, SUNY, I have worked extremely hard to get the scientific and engineering community to think, for lack of a better phase, outside the known technology box. There are only a few massively parallel, FPGA enabled DISCs deployed in science today. At Buffalo we have two of the largest in science coupled with a top 50 HPC Supercomputer (a bit bigger than Broad's HPC Cluster that Martin told us about at the breakfast breakout table) and GPU Cluster (100 Tflops). In general, the the FPGA DISC platforms perform 10X - 300X or more faster than traditional HPC architectures of similar configurations (CPUs) for data intensive applications.

Our work benchmarking one of these FPGA devices, versus HPC devices using SAN, IBRIX and Hadoop, is readily available in the August 2010 issue of IEEE Computers in Science and Engineering. Previous work on an earlier version of another of the the FPGA DISC architectures we have installed was presented at the HPC conference in India in 2009. Updated, but unpublished results, of follow on benchmarks of the 2009 paper delivered similar results to the 2010 IEEE paper.

Unfortunately, I can count on a few of my fingers where these FPGA DISC devices are deployed in the scientific community for genomic analytics. I do know that the really big search engine providers have racks and racks of these FPGA devices to optimize the generation of many nickels - really fast. The largest clinical healthcare database in the world (Premier) resides on one of the platforms too. The vendors we are using at the Di2 at Buffalo are Netezza and XtremeData.

Great post. Really challenging problem.

Regards,

Todd Scofield

By Todd Scofield (not verified) on 07 Oct 2010 #permalink

Unless there is a well funded parallel program of biomedical research that can make sense of the genomics data from a proteomics perspective, the genome sequencing efforts will yield primarily correlative data that will provide limited risk assessment at best. In view of the complexities of cellular regulation and metabolism, it will not offer conclusive data about the actual cause and progression of an individual's disease and how best to treat it. Unfortunately, much of the currently efforts to understand the roles and regulation of proteins are undertaken in simple animal models that are attractive primarily because of their ease of genetic manipulation. However, such studies have little relevance to the human condition. Without a better understanding of how mutations in genes affect protein function and protein interactions in a human context, genome-based diagnostics will in most situations probably not be much more beneficial than phrenology.

Phrenology is an ancient practice that was extremely popular about 200 years ago. It was based on the idea that formation of an individual's skull and bumps on their head could reveal information about their conduct and intellectual capacities. Phrenological thinking was influential in 19th-century psychiatry and modern neuroscience. While this practice is pretty much completely ridiculed now, it is amazing how many people still use astrology, I Ching, Tarot cards, biorhythms and other questionable practices to guide their lives, including medical decisions. I fear that an even wider portion of the general population will put their faith into whole genome-based analyses, especially with the strong encouragement of companies that could realize huge profits from offering such services. The most likely consequences, apart from yet another way for the sick to be parted from their money, is a lot more anxiety in the healthy population as well.

While I am sure that many of my colleagues may view my comparison of gene sequencing with obvious pseudo-sciences as inappropriate, the pace at which such genomics services are becoming offered to the general population warrants such consideration. We know much too little about the consequences of some 15 million mutations and other polymorphisms in the human genome to make sensible predictions about health risks. For only a few dozen human genes, primarily affected in cancer, do we have sufficient data to make reasonable pronouncements about the cause of a disease and the means to do something effective about it in the way of targeted therapy.

While it is easy to become exuberant about the power and potential of genomic analyses, the limitations of this type of technology alone to improve human health will soon become painfully obvious. Ultimately, economics will be the main driver of whether it is truly worthwhile to pursue whole genomic sequencing. This will not be dictated simply by the cost of whole genome sequencing, but as pointed out by others, the costs of storing and analyzing the data, and whether significant improvement outcomes in health care delivery actually materialize.

I am much less optimistic about the prospects of this. When I grew up in the 1960's, there was excitement about human colonies on the moon and manned missions to Mars before the turn of the 20th century. Nuclear power, including fusion, was going to solve our energy problems by this time. I believe in 30 years when we look back at current plans to sequence tens to hundreds of thousands of human genomes, we will be amazed at the naivety of proponents for this undertaking.

Dear Steven (and others participating in the dialogue):

Your points are very valid (the Phrenology was bit extreme, :-), but representative of the moment). The human body is a system. Many believe, and it is well documented, that understanding the pathways and message delivery mechanisms between core control centers is the key to creating and delivering effective therapeutics, and hopefully real solutions. Proteins are mechanisms for information (chemical) delivery in the human body.

My point is simple. If we do not have the appropriate computational architectures in place in the near future, a large portion of the virtually fixed (in size) and capable resource of computational biologists and related data management personnel will be spending way too much time configuring racks of storage or modifying database structures to support current analytic capabilities. The biological discovery community will become consumed by processing the many fire hoses of data versus analyzing them. Only a very few, who have access to the largest computational assets of many thousands of cores (and corresponding internal memory) will be able to solve complex biological networking problems. A combination of architectures is needed to effectively and efficiently solve these problems well at a reasonable cost.

There is a significant amount of genomic and related data that has been curated (to different levels) to date (for example NCBI/EBI data sets). Those data sets need to be profiled, cleansed and put into a single database, on the appropriate architectures so that some young guns (without institutional agendas) can find the interacting needles of knowledge in the hayfields of data.

Regards,

Todd

By Todd Scofield (not verified) on 07 Oct 2010 #permalink

@Todd: totally agree on Phrenology.... And also on:

The biological discovery community will become consumed by processing the many fire hoses of data versus analyzing them. Only a very few, who have access to the largest computational assets of many thousands of cores (and corresponding internal memory) will be able to solve complex biological networking problems.

The other problem inherent in this is that a processing group will be relegated to being the new "stamp collectors" in the same way that natural history museums became in their day. I love those museums, and think they are adding some amazing insights to biological problems today. But they don't get the respect they deserve.

And those other special few will be resented for an elitist/exclusivity. That will be fractious as the fight for resourcing ensues.

Another problem I haven't seen touched is that some of the data analysis strategies underway are planning to move more processing off the host server on to the client side. This is at the same time that local IT support wants to move people to more easy-to-maintain thinner type desktop services that are hard to customize for special purposes. If more of that processing moves to the end user, end users are in no way prepared for this.

There are multiple layers of this, none of which is a clear path. And something that always bothers me is that no one ever speaks for the end user. Some people think they are, but they are usually giving end-user musings as a provider rather than an actual real end user.