scio10: Science in the Cloud

John Hogenesch, Assistant Professor of Pharmacology - Penn School of Med

gene-at-a-time is giving way to genome wide - larger datasets, collaborative research

last year more added to genebank than all previous years combined (wow!) - exceeds Moore's law.

Academia responds by buying storage and clusters - but you need great IT staff - and it's really hard to get and keep them (they go to industry), heating & cooling, depreciation, usage/provisioning (under/over utilized). Larger inter-institutional grids - access is tightly regulated, they are very complex to program in/for

Cloud computing: software as a service, infrastructure as a service, platform as a service

They use SAAS for collaboration - basecamp from 37 signals. Collaborating with multiple labs, multiple people. Compare $50/month with no IT support costs to sharepoint $1k server, $500 license, admin 5% effort $2k.

IAAS for proteomics - example - search complex samples over 6 frame translated genome. They provisioned 20 AWS nodes, running windows, conducted over 7 days at a cost of $1400.

In genomics - lots of recent publications using cloudburst, crossbow (?), and hadoop for blast/blat/r scripts....

BLAT on AWS - using CloudCrowd (NY Times alternative to hadoop), provisioned 20 large memory instances of ubuntu, 85% of sequences were mapped, ~72 hours/$424 (experiments cost $30k with machine and reagents and all - so over the course of the 30 you can do in a year, 600k savings)

q: how much programming to get it ready to go on AWS?

a: about 8 hours with a somewhat experienced programmer - a very experienced on could do it in 1-2hours - programming is done in Ruby

PAAS - aggregating clouds - genome wide screen for modifiers of the circadian clock , 300 found, (Zhang et al Cell, 2009), gene cetric data integration - go to each data site and search for your gene and then compile. ID/synonym resolution is hard. BioGPS - federated search of these gene sources - URL based scheme, extensible. Puts results from different sources in boxes on BioGPS. Has a catalog search so you can see if you can buy from Invitrogen (sponsor, thank you!) and others. (http://biogps.gnf.org/circadian)

PAAS use case - publishing in the cloud - Plos Currents Influenza. pmids used for references, google knol to write, moderators decide suitable/unsuitable - not review. PLOS will consider expanded versions in their pubs. ~52 publications so far. Example has been viewed 7k times.

q: biobase - only mammalian?

a: yes, but code is available (.net) so you could customize

q: small vs. large institutions - does this help people who are under resourced for equipment

with this we can give you the algorithm and then you could run it on the same service - so this is different from just sharing algorithms

q: writing grants etc. how does that go with cloud services?

a: capital costs (buying servers) is typically out of a different bucket so this might complicate. Some in the room have had success, no problems. Some have met skepticism. In the UK they're very concerned about the PATRIOT act provisions.

q: do you need an AWS specialist

a: they had someone with an MS in bioinformatics and a bs in bio - picked up how to do the first in a week, second done in 8 hours. Could probably replace that person fairly easily

q: concern with using a free service online - stability/preservation of data

a: test to see about getting data out after you set up an account, if super important then host on your own site

q: using these in teaching?

a: using wave, using pbwiki, using blackboard, using open wetware wiki, (i use OneNote), also googledocs (they tried wikis first, didn't fly, googledocs works well for them)

q: proportion of work done in cloud vs. local computing resources

q: boundaries of the institution

a: now either academic or industrial - so this will probably  allow independent investigators again, rent some lab time, rent some computing time and then prototype something. Can also use publically available data - always lots more things to find/use it for than just what originators foresaw

Categories

More like this

The series of interviews with some of the participants of the 2008 Science Blogging Conference was quite popular, so I decided to do the same thing again this year, posting interviews with some of the people who attended ScienceOnline'09 back in January. This is also the first in what I hope will…
Genome Biology recently published a review, "The Case for Cloud Computing in Genome Informatics." What is cloud computing? Well: This is a general term for computation-as-a-service. There are various different types of cloud computing, but the one that is closest to the way that computational…
These days, DNA sequencing happens in one of three ways. In the early days of DNA sequencing (like the 80's), labs prepared their own samples, sequenced those samples, and analyzed their results. Some labs still do this. Then, in the 90's, genome centers came along. Genome centers are like giant…
I'm here at The Informationist: Collaboration between scientists and librarians to support informatics research at the Embassy Suites in DC. It's sponsored by Elsevier as part of their Research Connect series. (stream of consciousness) Dr John L Schnase, NASA - Science and technology challenges of…

This is an important development. I see the power of information in the cloud extending to all areas of research, and even medical diagnosis. Thanks for the post.