Will Cloud Computing Help Genomics Handle Post-Moore's Law Data Loads?

By mikethemadbiologist on May 25, 2010.

Genome Biology recently published a review, "The Case for Cloud Computing in Genome Informatics." What is cloud computing? Well:

This is a general term for computation-as-a-service. There are various different types of cloud computing, but the one that is closest to the way that computational biologists currently work depends on the concept of a 'virtual machine'. In the traditional economic model of computation, customers purchase server, storage and networking hardware, configure it the way they need, and run software on it. In computation-as-a-service, customers essentially rent the hardware and storage for as long or as short a time as they need to achieve their goals. Customers pay only for the time the rented systems are running and only for the storage they actually use.

This model would be lunatic if the rented machines were physical ones. However, in cloud computing, the rentals are virtual: without ever touching a power cable, customers can power up a fully functional 10-computer server farm with a terabyte of shared storage, upgrade the cluster in minutes to 100 servers when needed for some heavy duty calculations, and then return to the baseline 10-server system when the extra virtual machines are no longer needed.
The way it works is that a service provider puts up the capital expenditure of creating an extremely large compute and storage farm (tens of thousands of nodes and petabytes of storage) with all the frills needed to maintain an operation of this size, including a dedicated system administration staff, storage redundancy, data centers distributed to strategically placed parts of the world, and broadband network connectivity. The service provider then implements the infrastructure to give users the ability to create, upload and launch virtual machines on this compute farm. Because of economies of scale, the service provider can obtain highly discounted rates on hardware, electricity and network connectivity, and can pass these savings on to the end users to make virtual machine rental economically competitive with purchasing the real thing.

So why would genomics need cloud computing? The answer is simple: our ability to generate data has already outstripped Moore's Law. In other words, the amount of data that needs to be manipulated, as well as transferred from server to server, is so vast that we need a new model. It's just not efficient to duplicate massive computing cores at multiple academic centers. Having talked with one NIH informaticist, I can tell you that NIH definitely wants to move to this model--they do not want to keep building new cores every time they fund a moderately sized genomics project.

But before we get lost in the clouds (so to speak), there are a couple of problems. The first is that many genome projects have metadata that can't be released to the public. Any cloud computing system, whether privately owned or public, will have to grapple with this. But the really challenging problem is a very straightforward technical one--uploading and downloading data to and from the cloud:

For genomics, the biggest obstacle to moving to the cloud may well be network bandwidth. A typical research institution will have network bandwidth of about a gigabit/second (roughly 125 megabytes/second). On a good day this will support sustained transfer rates of 5 to 10 megabytes/second across the internet. Transferring a 100 gigabyte next-generation sequencing data file across such a link will take about a week in the best case. A 10 gigabit/second connection (1.25 gigabytes/second), which is typical for major universities and some of the larger research institutions, reduces the transfer time to under a day, but only at the cost of hogging much of the institution's bandwidth. Clearly cloud services will not be used for production sequencing any time soon.

As I was reading this, I remembered that, when several centers were collaborating to test new sequencing technologies, the data were so large, they actually shipped hard drives to each other to compare results. Well, that's what might have to happen to upload data:

If cloud computing is to work for genomics, the service providers will have to offer some flexibility in how large datasets get into the system. For instance, they could accept external disks shipped by mail the way that the Protein Database once accepted atomic structure submissions on tape and floppy disk. In fact, a now-defunct Google initiative called Google Research Datasets once planned to collect large scientific datasets by shipping around 3-terabyte disk arrays.

The other possibility is that the raw data, or even 'first-step' processed data might not be made publicly available anymore--think of this as the physics model:

At some future point it will become simply unfeasible to store all raw sequencing reads in a central archive or even in local storage. Genome biologists will have to start acting like the high energy physicists, who filter the huge datasets coming out of their collectors for a tiny number of informative events and then discard the rest.

As genomics and other data-intensive disciplines of biology move towards cloud computing (and I think it will definitely happen), it will be interesting to see how NIH funding shifts. Hopefully, this means more resources will be shifted to people who know how to use the data.

Yeah, I know: Magic Pony Time....

Cited article: Stein, L.D. 2010. The case for cloud computing in genome informatics. Genome Biology 11:207doi:10.1186/gb-2010-11-5-207.

More like this

My Genius Prediction About the Short Read Archive: TOO MANY DATAZ!

The National Center for Biotechnology Information (NCBI) recently announced that it will shut down the Short Read Archive (SRA). The SRA stored the semi-processed data for genomics projects, so researchers could examine the raw data for a genomics project. The reason given by NCBI is "budget…

In the cloud, Next Gen DNA sequencing computes

These days, DNA sequencing happens in one of three ways. In the early days of DNA sequencing (like the 80's), labs prepared their own samples, sequenced those samples, and analyzed their results. Some labs still do this. Then, in the 90's, genome centers came along. Genome centers are like giant…

Cloud Computing

In general, I try to keep the content of this blog away from my work. I don't do that because it would get me in trouble, but rather because I spend enough time on work, and blogging is my hobby. But sometimes there's an overlap. One thing that's come up in a lot of conversations and a lot of…

There Is a Difference Between a Sequencing Revolution and a Genomics Revolution

Last week, Forbes had an article about the advances in genomics, which focused on the Ion Torrent sequencing platform. It's a good overview of genomics and the Ion Torrent technology, albeit a bit much on the cheerleading side. For instance, this: Audaciously named the Personal Genome Machine (…

The short answer is yes. People are already using other distributed systems for doing gene mapping, studying protein folding, and other things that are interesting. Cloud computing is just one other method of running the computations and analysis in parallel or with load balancing.

I am actually trying to setup a distributed system for some of the comp sci classes that I am teaching in Nigeria. Liked your post.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

EPA Reconsiders Its Biden Ban On Asbestos Everywhere

More by this author

Program Announcement: I'm Moving

September 1, 2011

I've dropped some hints in the past that my relationship with ScienceBlogs would be...altered. Well, I've decided to leave. Mostly, it had to do with the issue of pseudonymity, although I'm very excited to hang out my own shingle once again. I don't want to rehash the issue of pseudonymity,…

Note to Unions: This Is Not How You Build a Coalition

September 1, 2011

The old saw that 'we hang together or we get hung separately' is a perfect description of how the left has disintegrated into irrelevance. Too often, groups will focus on modest gains for their own narrow constituency, while selling out other allies. Over the long term, each component of the…

Links 8/31/11

August 31, 2011

Links for you. Science: Underground river 'Rio Hamza' discovered 4km beneath the Amazon What do accommodationists do about creationist politicians? I've Been Told You Can Get Flu From the Flu Shot: False! Federal Work Suspension of Leading Arctic Scientist Ended as Investigation of His…

Meet the New New Math, Same As the Old New Math? What We Can Learn from Finland

August 31, 2011

Recently, The New York Times published an op-ed calling for curricular changes in K-12 math education: Today, American high schools offer a sequence of algebra, geometry, more algebra, pre-calculus and calculus (or a "reform" version in which these topics are interwoven). This has been codified by…

Links 8/30/11

August 30, 2011

Links for you. Another Scientist Calls Out Sen. Coburn's Misleading, Juvenile "Report" XMRV: ITS EVERYWHERE! UUUUUGH! ITS IN MY RACCOON WOUNDS! AND MY QIAGEN COLUMNS! Coulter Goes All Science-y in Bid to Disprove Evolution Yet another bad day for the anti-vaccine movement 2011 Antibiotics: Killing…