John Hogenesch, Assistant Professor of Pharmacology – Penn School of Med
gene-at-a-time is giving way to genome wide – larger datasets, collaborative research
last year more added to genebank than all previous years combined (wow!) – exceeds Moore’s law.
Academia responds by buying storage and clusters – but you need great IT staff – and it’s really hard to get and keep them (they go to industry), heating & cooling, depreciation, usage/provisioning (under/over utilized). Larger inter-institutional grids – access is tightly regulated, they are very complex to program in/for
Cloud computing: software as a service, infrastructure as a service, platform as a service
They use SAAS for collaboration – basecamp from 37 signals. Collaborating with multiple labs, multiple people. Compare $50/month with no IT support costs to sharepoint $1k server, $500 license, admin 5% effort $2k.
IAAS for proteomics – example – search complex samples over 6 frame translated genome. They provisioned 20 AWS nodes, running windows, conducted over 7 days at a cost of $1400.
In genomics – lots of recent publications using cloudburst, crossbow (?), and hadoop for blast/blat/r scripts….
BLAT on AWS – using CloudCrowd (NY Times alternative to hadoop), provisioned 20 large memory instances of ubuntu, 85% of sequences were mapped, ~72 hours/$424 (experiments cost $30k with machine and reagents and all – so over the course of the 30 you can do in a year, 600k savings)
q: how much programming to get it ready to go on AWS?
a: about 8 hours with a somewhat experienced programmer – a very experienced on could do it in 1-2hours – programming is done in Ruby
PAAS – aggregating clouds – genome wide screen for modifiers of the circadian clock , 300 found, (Zhang et al Cell, 2009), gene cetric data integration – go to each data site and search for your gene and then compile. ID/synonym resolution is hard. BioGPS – federated search of these gene sources – URL based scheme, extensible. Puts results from different sources in boxes on BioGPS. Has a catalog search so you can see if you can buy from Invitrogen (sponsor, thank you!) and others. (http://biogps.gnf.org/circadian)
PAAS use case – publishing in the cloud – Plos Currents Influenza. pmids used for references, google knol to write, moderators decide suitable/unsuitable – not review. PLOS will consider expanded versions in their pubs. ~52 publications so far. Example has been viewed 7k times.
q: biobase – only mammalian?
a: yes, but code is available (.net) so you could customize
q: small vs. large institutions – does this help people who are under resourced for equipment
with this we can give you the algorithm and then you could run it on the same service – so this is different from just sharing algorithms
q: writing grants etc. how does that go with cloud services?
a: capital costs (buying servers) is typically out of a different bucket so this might complicate. Some in the room have had success, no problems. Some have met skepticism. In the UK they’re very concerned about the PATRIOT act provisions.
q: do you need an AWS specialist
a: they had someone with an MS in bioinformatics and a bs in bio – picked up how to do the first in a week, second done in 8 hours. Could probably replace that person fairly easily
q: concern with using a free service online – stability/preservation of data
a: test to see about getting data out after you set up an account, if super important then host on your own site
q: using these in teaching?
a: using wave, using pbwiki, using blackboard, using open wetware wiki, (i use OneNote), also googledocs (they tried wikis first, didn’t fly, googledocs works well for them)
q: proportion of work done in cloud vs. local computing resources
q: boundaries of the institution
a: now either academic or industrial – so this will probably allow independent investigators again, rent some lab time, rent some computing time and then prototype something. Can also use publically available data – always lots more things to find/use it for than just what originators foresaw