Will Cloud Computing Help Genomics Handle Post-Moore's Law Data Loads?

By mikethemadbiologist on May 25, 2010.

Genome Biology recently published a review, "The Case for Cloud Computing in Genome Informatics." What is cloud computing? Well:

This is a general term for computation-as-a-service. There are various different types of cloud computing, but the one that is closest to the way that computational biologists currently work depends on the concept of a 'virtual machine'. In the traditional economic model of computation, customers purchase server, storage and networking hardware, configure it the way they need, and run software on it. In computation-as-a-service, customers essentially rent the hardware and storage for as long or as short a time as they need to achieve their goals. Customers pay only for the time the rented systems are running and only for the storage they actually use.

This model would be lunatic if the rented machines were physical ones. However, in cloud computing, the rentals are virtual: without ever touching a power cable, customers can power up a fully functional 10-computer server farm with a terabyte of shared storage, upgrade the cluster in minutes to 100 servers when needed for some heavy duty calculations, and then return to the baseline 10-server system when the extra virtual machines are no longer needed.
The way it works is that a service provider puts up the capital expenditure of creating an extremely large compute and storage farm (tens of thousands of nodes and petabytes of storage) with all the frills needed to maintain an operation of this size, including a dedicated system administration staff, storage redundancy, data centers distributed to strategically placed parts of the world, and broadband network connectivity. The service provider then implements the infrastructure to give users the ability to create, upload and launch virtual machines on this compute farm. Because of economies of scale, the service provider can obtain highly discounted rates on hardware, electricity and network connectivity, and can pass these savings on to the end users to make virtual machine rental economically competitive with purchasing the real thing.

So why would genomics need cloud computing? The answer is simple: our ability to generate data has already outstripped Moore's Law. In other words, the amount of data that needs to be manipulated, as well as transferred from server to server, is so vast that we need a new model. It's just not efficient to duplicate massive computing cores at multiple academic centers. Having talked with one NIH informaticist, I can tell you that NIH definitely wants to move to this model--they do not want to keep building new cores every time they fund a moderately sized genomics project.

But before we get lost in the clouds (so to speak), there are a couple of problems. The first is that many genome projects have metadata that can't be released to the public. Any cloud computing system, whether privately owned or public, will have to grapple with this. But the really challenging problem is a very straightforward technical one--uploading and downloading data to and from the cloud:

For genomics, the biggest obstacle to moving to the cloud may well be network bandwidth. A typical research institution will have network bandwidth of about a gigabit/second (roughly 125 megabytes/second). On a good day this will support sustained transfer rates of 5 to 10 megabytes/second across the internet. Transferring a 100 gigabyte next-generation sequencing data file across such a link will take about a week in the best case. A 10 gigabit/second connection (1.25 gigabytes/second), which is typical for major universities and some of the larger research institutions, reduces the transfer time to under a day, but only at the cost of hogging much of the institution's bandwidth. Clearly cloud services will not be used for production sequencing any time soon.

As I was reading this, I remembered that, when several centers were collaborating to test new sequencing technologies, the data were so large, they actually shipped hard drives to each other to compare results. Well, that's what might have to happen to upload data:

If cloud computing is to work for genomics, the service providers will have to offer some flexibility in how large datasets get into the system. For instance, they could accept external disks shipped by mail the way that the Protein Database once accepted atomic structure submissions on tape and floppy disk. In fact, a now-defunct Google initiative called Google Research Datasets once planned to collect large scientific datasets by shipping around 3-terabyte disk arrays.

The other possibility is that the raw data, or even 'first-step' processed data might not be made publicly available anymore--think of this as the physics model:

At some future point it will become simply unfeasible to store all raw sequencing reads in a central archive or even in local storage. Genome biologists will have to start acting like the high energy physicists, who filter the huge datasets coming out of their collectors for a tiny number of informative events and then discard the rest.

As genomics and other data-intensive disciplines of biology move towards cloud computing (and I think it will definitely happen), it will be interesting to see how NIH funding shifts. Hopefully, this means more resources will be shifted to people who know how to use the data.

Yeah, I know: Magic Pony Time....

Cited article: Stein, L.D. 2010. The case for cloud computing in genome informatics. Genome Biology 11:207doi:10.1186/gb-2010-11-5-207.

More like this

New Kind of Cloud

In 2008, we href="http://news.bbc.co.uk/2/hi/uk_news/magazine/7574684.stm">were informed that a kind of cloud formation had been named: the mammatus formation, so-called because it resembles a breast. Sort of. Whatever.

The Models Don't Have Clouds

This is just one of dozens of responses to common climate change denial arguments, which can all be found at How to Talk to a Climate Sceptic. Objection:

Steve Ballmer Talk at UW March 4, 2010

Today Microsoft CEO Steve Ballmer spoke at the University of Washington in the Microsoft Atrium of the Computer Science & Engineering department's Paul Allen Center. As you can tell from that first sentence UW and Microsoft have long had very tight connections.

Noctilucent Clouds

The short answer is yes. People are already using other distributed systems for doing gene mapping, studying protein folding, and other things that are interesting. Cloud computing is just one other method of running the computations and analysis in parallel or with load balancing.

I am actually trying to setup a distributed system for some of the comp sci classes that I am teaching in Nigeria. Liked your post.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

Program Announcement: I'm Moving

September 1, 2011

I've dropped some hints in the past that my relationship with ScienceBlogs would be...altered. Well, I've decided to leave. Mostly, it had to do with the issue of pseudonymity, although I'm very excited to hang out my own shingle once again. I don't want to rehash the issue of pseudonymity,…

Note to Unions: This Is Not How You Build a Coalition

September 1, 2011

The old saw that 'we hang together or we get hung separately' is a perfect description of how the left has disintegrated into irrelevance. Too often, groups will focus on modest gains for their own narrow constituency, while selling out other allies. Over the long term, each component of the…

Links 8/31/11

August 31, 2011

Links for you. Science: Underground river 'Rio Hamza' discovered 4km beneath the Amazon What do accommodationists do about creationist politicians? I've Been Told You Can Get Flu From the Flu Shot: False! Federal Work Suspension of Leading Arctic Scientist Ended as Investigation of His…

Meet the New New Math, Same As the Old New Math? What We Can Learn from Finland

August 31, 2011

Recently, The New York Times published an op-ed calling for curricular changes in K-12 math education: Today, American high schools offer a sequence of algebra, geometry, more algebra, pre-calculus and calculus (or a "reform" version in which these topics are interwoven). This has been codified by…

Links 8/30/11

August 30, 2011

Links for you. Another Scientist Calls Out Sen. Coburn's Misleading, Juvenile "Report" XMRV: ITS EVERYWHERE! UUUUUGH! ITS IN MY RACCOON WOUNDS! AND MY QIAGEN COLUMNS! Coulter Goes All Science-y in Bid to Disprove Evolution Yet another bad day for the anti-vaccine movement 2011 Antibiotics: Killing…