Mike the Mad Biologist

Genome Biology recently published a review, “The Case for Cloud Computing in Genome Informatics.” What is cloud computing? Well:

This is a general term for computation-as-a-service. There are various different types of cloud computing, but the one that is closest to the way that computational biologists currently work depends on the concept of a ‘virtual machine’. In the traditional economic model of computation, customers purchase server, storage and networking hardware, configure it the way they need, and run software on it. In computation-as-a-service, customers essentially rent the hardware and storage for as long or as short a time as they need to achieve their goals. Customers pay only for the time the rented systems are running and only for the storage they actually use.

This model would be lunatic if the rented machines were physical ones. However, in cloud computing, the rentals are virtual: without ever touching a power cable, customers can power up a fully functional 10-computer server farm with a terabyte of shared storage, upgrade the cluster in minutes to 100 servers when needed for some heavy duty calculations, and then return to the baseline 10-server system when the extra virtual machines are no longer needed.

The way it works is that a service provider puts up the capital expenditure of creating an extremely large compute and storage farm (tens of thousands of nodes and petabytes of storage) with all the frills needed to maintain an operation of this size, including a dedicated system administration staff, storage redundancy, data centers distributed to strategically placed parts of the world, and broadband network connectivity. The service provider then implements the infrastructure to give users the ability to create, upload and launch virtual machines on this compute farm. Because of economies of scale, the service provider can obtain highly discounted rates on hardware, electricity and network connectivity, and can pass these savings on to the end users to make virtual machine rental economically competitive with purchasing the real thing.

So why would genomics need cloud computing? The answer is simple: our ability to generate data has already outstripped Moore’s Law. In other words, the amount of data that needs to be manipulated, as well as transferred from server to server, is so vast that we need a new model. It’s just not efficient to duplicate massive computing cores at multiple academic centers. Having talked with one NIH informaticist, I can tell you that NIH definitely wants to move to this model–they do not want to keep building new cores every time they fund a moderately sized genomics project.

But before we get lost in the clouds (so to speak), there are a couple of problems. The first is that many genome projects have metadata that can’t be released to the public. Any cloud computing system, whether privately owned or public, will have to grapple with this. But the really challenging problem is a very straightforward technical one–uploading and downloading data to and from the cloud:

For genomics, the biggest obstacle to moving to the cloud may well be network bandwidth. A typical research institution will have network bandwidth of about a gigabit/second (roughly 125 megabytes/second). On a good day this will support sustained transfer rates of 5 to 10 megabytes/second across the internet. Transferring a 100 gigabyte next-generation sequencing data file across such a link will take about a week in the best case. A 10 gigabit/second connection (1.25 gigabytes/second), which is typical for major universities and some of the larger research institutions, reduces the transfer time to under a day, but only at the cost of hogging much of the institution’s bandwidth. Clearly cloud services will not be used for production sequencing any time soon.

As I was reading this, I remembered that, when several centers were collaborating to test new sequencing technologies, the data were so large, they actually shipped hard drives to each other to compare results. Well, that’s what might have to happen to upload data:

If cloud computing is to work for genomics, the service providers will have to offer some flexibility in how large datasets get into the system. For instance, they could accept external disks shipped by mail the way that the Protein Database once accepted atomic structure submissions on tape and floppy disk. In fact, a now-defunct Google initiative called Google Research Datasets once planned to collect large scientific datasets by shipping around 3-terabyte disk arrays.

The other possibility is that the raw data, or even ‘first-step’ processed data might not be made publicly available anymore–think of this as the physics model:

At some future point it will become simply unfeasible to store all raw sequencing reads in a central archive or even in local storage. Genome biologists will have to start acting like the high energy physicists, who filter the huge datasets coming out of their collectors for a tiny number of informative events and then discard the rest.

As genomics and other data-intensive disciplines of biology move towards cloud computing (and I think it will definitely happen), it will be interesting to see how NIH funding shifts. Hopefully, this means more resources will be shifted to people who know how to use the data.

Yeah, I know: Magic Pony Time….

Cited article: Stein, L.D. 2010. The case for cloud computing in genome informatics. Genome Biology 11:207doi:10.1186/gb-2010-11-5-207.


  1. #1 tenacitus
    May 26, 2010

    The short answer is yes. People are already using other distributed systems for doing gene mapping, studying protein folding, and other things that are interesting. Cloud computing is just one other method of running the computations and analysis in parallel or with load balancing.

    I am actually trying to setup a distributed system for some of the comp sci classes that I am teaching in Nigeria. Liked your post.