hit counter joomla

These days, DNA sequencing happens in one of three ways.

In the early days of DNA sequencing (like the 80′s), labs prepared their own samples, sequenced those samples, and analyzed their results. Some labs still do this.

Then, in the 90′s, genome centers came along. Genome centers are like giant factories that manufacture sequence data. They have buildings, dedicated staff, and professional bioinformaticians who write programs and work with other factory members to get the data entered, analyzed, and shipped out to the databases. (You can learn more about this and go on a virtual tour in this nice video from Washington University).

At the same time, Universities expanded their core service laboratories and these labs began offering a greater number of sequencing services. Today, much the non-genome center sequencing happens in core labs. Scientists obtain samples and send them to the core labs. The core lab staff prepare the samples, carry out the sequencing reactions, and deliver data to their customers.

This system worked fine until the Next Generation DNA sequencing (NGS) instruments came along.

This January issue of Nature Biotechnology has two articles (1, 2) that address the role cloud computing can play in helping smaller laboratories cope with the large volumes of data produced by NGS.

As noted in the editorial (1):

Next-generation sequencers produce a prodigious stream of data. A single Illumina instrument, for example, can generate up to 90 billion bases per run. This represents terabytes of raw image data that require at a minimum 4 GB of RAM and 750 GB of local storage capacity to carry out the data handling and analysis.


Whereas genome centers are set up to deal with such gargantuan files, most academic laboratories are in a completely different situation. They have no large central computing pool and data storage capacity. They are more likely to generate data in an ad hoc manner, rather than in a steady stream amenable to an automated data management pipeline. And they often lack sequencing specialists and support staff working under the same roof who can create software tailored to their needs and solve computational problems.

Cloud computing is one answer to that problem.

Although the article (2) places a strong emphasis on security concerns related to cloud computing, it does a good job describing how Amazon has worked to make the cloud comply with the Health Insurance Portability and Accountability Act (HIPAA). The article also presents a table of cloud service providers.

Interestingly, Geospiza is the only company in the table that offers a software system for dealing with the LIMS needs and analyzing Next Generation DNA sequencing data. All the others are part of the cloud infrastructure.

I can’t offer an unbiased opinion because I’ve worked at Geospiza, but I can attest that the cloud works well. I used an early version of the system last spring when I was writing an article for Current Protocols in Bioinformatics. You can even take a look at some of the results in Geospiza’s data center. The information for logging in is on that page.

I did the analysis by getting both Illumina and ABI SOLiD data sets from the NCBI. I used Geospiza’s web interface to upload the data somewhere in the cloud, selected an alignment algorithm and a reference data set, and waited a few hours for the analysis to complete. It was pretty straightforward. I didn’t need to get a new electrical system or new hardware or even write any programs.

I can hardly wait to try out the new stuff for looking at allele-specific expression, SNPs, and mapping splice junctions (see a picture).

1. Clare Sansom. Up in a cloud. 2010. Nature Biotechnology 28, 13 – 15.

2. Gathering clouds and a sequencing storm 2010. Nature Biotechnology 28, 1.

3. Porter, S., Olsen, N., and T. Smith. 2009. Analyzing Gene Expression Data from Microarray and Next‐Generation DNA Sequencing Transcriptome Profiling Assays Using GeneSifter Analysis Edition. Current Protocols in Bioinformatics. DOI: 10.1002/0471250953.bi0714s27