Will the Cloud Save Genomics?

I've blogged before about some of the technical issues surrounding how we can handle the massive increase in the size of genomics datasets. There's also a need to grapple with the analytical aspects of all of these data:

So, from a bacterial perspective, genome sequencing is really cheap and fast--in about a year, I conservatively estimate (very conservatively) that the cost of sequencing a bacterial genome could drop to about $1,500 (currently, commercial companies will do a high-quality draft for around $5,000- $6,000). We are entering an era where the time and money costs won't be focused on raw sequence generation, but on the informatics needed to build high-quality genomes with those data.

Titus Brown does the math and then puts the issue very succinctly:

The bottom line is this: when your data cost is decreasing faster than your hardware cost, the long-term solution cannot be to buy, rent, borrow, beg, or steal more hardware. The solution must lie in software and algorithms.

He argues that proponents of cloud computing as our salvation must be relying on something else:

People who claim that cloud computing is going to provide an answer to the scaling issue with sequence, then, must be operating with some additional assumptions. Maybe they think the curves are shifted relative to one another, so that even 1000x costs are not a big deal - although figure 1 sort of argues against that. Like me, maybe they've heard that hard disks are about to start scaling way, way better -- if so, awesome! That might change the curves for data storage, if not analysis. Perhaps their research depends on using only a bounded amount of sequence -- e.g. single-genome sequencing, for which you can stop generating data at a certain point. Or perhaps they're proposing to use algorithms that scale sub-linearly with the amount of data they're applied to (although I don't know of any). Or perhaps they're planning for the shift in Moore's Law behavior that will come when that Amazon and other cloud computing providers build self-replicating compute clusters on the moon (hello, exo-atmospheric computing!) Whatever the plan, it would be interesting to hear their assumptions explained.

That could be, but I think it has more to do with the basic reality that very few groups are currently analyzing massive datasets. Simply put, we haven't realized the problems because we're just now blundering into them.

Regardless, Brown is right: we will be spending a lot more money on software and people than sequencing, cloud or not.

Categories

More like this

I stated as early as in 2008 in peer-reviewed science papers (The Principle of Recursive Genome Function, and Cold Spring Harbor Lab presentation) as well as popularized in Google Tech YouTube "Is IT ready for the Dreaded DNA Data Deluge" http://www.youtube.com/watch?v=WJMFuc75V_w the same that Mike underscores: the bottleneck is NOT Information TECHNOLOGY but Information Theory of fractal iterative recursion of genome regulation. Prediction at 30:00 min. of the YouTube to show that unregulated cancerous fractal growth is caused by aberrant methylation of intergenic (formerly, "junk") supplementary info lends itself to software-enabling algorithms, based on a crisp mathematical understanding of recursive genome function. With the availability (soon a veritable avalanche) of both cancerous and intact DNA data-sets (showing methylation status e.g. by PacBio sequencing technology) HolGenTech deploys defense-validated High-Performance-Computing platforms in private clouds to run software based on advanced algorithms.

Perhaps we'll see a shift in the way sequencing-based science is done. Rather than a "more is better", sequence everything you can approach, we'll see people only actually sequencing what they can analyze. Which is something I'd like to see anyway. I'm a bit tired of "Wellâ¦there's the sequence" based publications.

Thanks, "jigolo". "Confounding", you are right, and more. Not only one can get "a bit tired" of sequences without analytics, but the entire (mega-billion-dollar) sequencing industry might simply become unsustainable without matching investments into analytics. We are talking about the supply-demand balance of Industrialization of Genomics. My favorite metaphor is "imagine the nonsense of Ford's assembly line of automobiles - but only a few dirt roads here and there with zero gas stations". The cloud is only a band-aid, such that the oozing sequences could be stored somewhere, but in itself computing ability will never solve the problem of intrinsic (fractal) mathematics of genome regulation.