I’ve blogged before about some of the technical issues surrounding how we can handle the massive increase in the size of genomics datasets. There’s also a need to grapple with the analytical aspects of all of these data:
So, from a bacterial perspective, genome sequencing is really cheap and fast–in about a year, I conservatively estimate (very conservatively) that the cost of sequencing a bacterial genome could drop to about $1,500 (currently, commercial companies will do a high-quality draft for around $5,000- $6,000). We are entering an era where the time and money costs won’t be focused on raw sequence generation, but on the informatics needed to build high-quality genomes with those data.
Titus Brown does the math and then puts the issue very succinctly:
The bottom line is this: when your data cost is decreasing faster than your hardware cost, the long-term solution cannot be to buy, rent, borrow, beg, or steal more hardware. The solution must lie in software and algorithms.
He argues that proponents of cloud computing as our salvation must be relying on something else:
People who claim that cloud computing is going to provide an answer to the scaling issue with sequence, then, must be operating with some additional assumptions. Maybe they think the curves are shifted relative to one another, so that even 1000x costs are not a big deal – although figure 1 sort of argues against that. Like me, maybe they’ve heard that hard disks are about to start scaling way, way better — if so, awesome! That might change the curves for data storage, if not analysis. Perhaps their research depends on using only a bounded amount of sequence — e.g. single-genome sequencing, for which you can stop generating data at a certain point. Or perhaps they’re proposing to use algorithms that scale sub-linearly with the amount of data they’re applied to (although I don’t know of any). Or perhaps they’re planning for the shift in Moore’s Law behavior that will come when that Amazon and other cloud computing providers build self-replicating compute clusters on the moon (hello, exo-atmospheric computing!) Whatever the plan, it would be interesting to hear their assumptions explained.
That could be, but I think it has more to do with the basic reality that very few groups are currently analyzing massive datasets. Simply put, we haven’t realized the problems because we’re just now blundering into them.
Regardless, Brown is right: we will be spending a lot more money on software and people than sequencing, cloud or not.