Statistics on next-generation sequencing platforms

By dgmacarthur on December 4, 2008.

David Dooling from PolITiGenomics has put together a handy little table for genomics nerds like me: statistics on the output of the various iterations of the three major competing second-generation DNA sequencing platforms (Roche's 454, Illumina's Solexa/Genome Analyzer and ABI's SOLiD).

It's a little inscrutable for non-genomicists, but it helps to provide some insight into the sheer scale of the DNA sequence data currently being produced by large-scale sequencing facilities. A single Illumina GA II machine, for instance, churns out at least 8 gigabases of sequence (that's almost three human genome equivalents) every week. Now consider that the Sanger Institute as of this week has 37 Illumina machines, all running hot pretty much 24/7 (well, most of the time), and you have some sense of the quantity of sequence currently being generated - and of the informatics infrastructure required to store, manage and process that volume of data.

More like this

Experience the Power and Mystery of Genetics With Illumina

Join Illumina in its quest to discover "What makes you... you?" And to discern why even though people are genetically 99.9 % similar, we look and act so incredibly different!

Update on the chip supply rumour from the FDA

Two days ago I reported a rumour that the FDA might have convinced genotyping chip provider Illumina to stop providing its products to direct-to-consumer genetic testing companies - a move that would ef

The Illumina whole-genome-sequence

I notice that Fortune has a story on personal genomics up, Genetic sequencing gets personal Biotech firm Illumina will sequence your entire genetic code -- and throw i

Will new sequencing technology kill arrays

Keith Robison has a perceptive piece riffing off the recent Illumina instrument launch, and ponders whether 2010 will be the yea

So 8 billion 2 bit numbers (4 bases right?) per week. That is 16 billion bits or 2 gigabytes of data.

Quadruple that for mirroring, and overhead, then add 2 more copies for backup and analysis gives 2 Gig per day of data per machine. Multiply by 40 machines and you've got 80 Gig a day. 2 and a half terabytes per month. Getting somewhat serious but not that expensive. With appropriate compression I am sure you could get 4 to 1 compression on this or better. About 10 TB a year. Not much nowadays. $20K should do for up front costs including a tape backup. Admin costs should dominate your budget. The storage would be a pittabnce compared to the cost of the analyzers.

Dooling got me drooling!

Hi Mark,

Well, it's not quite that simple.

The initial data generated by the machines are in the form of enormous image files, taking up on the order of 1-1.5 Tb per week. Processing those data into useful sequence files takes some fairly hefty computing; and then after that you can't simply store the raw sequence (as in your calculations) but also a whole bunch of meta-data like quality scores, which are important for downstream analysis. According to David's figures the final volume of data submitted for storage is on the order of 100 Gb per machine per week (albeit presumably without compression).

Multiply that by 40 machines over a year and you're looking at around 200 Tb a year of data for storage, plus the computational power to chew through the primary analysis of ~50 Tb of image data every week, and then for downstream processing (assembly, identification of genetic variants, and so on).

And of course this is just the beginning - in a year or so the current platforms will be churning out substantially more data (with tweaking and upgrades), and there will be new platforms with even more laughably high output.

I'm still new to next-gen so I can't really talk much about the details, but I gather from the people I'm working with that handling the informatics side is now posing a far greater challenge than actually generating the data (which is itself non-trivial).

Is there an FTP site where one can download sample next gene sequencing data? Or is it too big? Maybe you can send somebody a 1TB hard disk and they send it back.

dogface - you can play around with the data in the Short Read Archive, but the downloads do tend to be pretty large.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

Who Controls The Chicken Controls The World

More by this author

Genetic Future is moving

January 18, 2011

After a semi-hiatus due to various distractions, I'm about to restart blogging in earnest again over at the new home of Genetic Future on Wired Science. Please update your RSS feed: my new one is here. And a reminder: you can always keep track of new posts here as well as other nuggets of…

One more step towards the end of recessive diseases

January 13, 2011

In the last century infant mortality has declined precipitously in the Western world, thanks in large part to the development of antibiotics and vaccination. Yet as the suffering and death from infectious disease has reduced, the burden from genetic disease has become proportionately greater:…

New FireFox plugin for 23andMe customers

January 11, 2011

Software company 5AM Solutions has just launched a neat little FireFox plug-in for customers of consumer genomics company 23andMe. The idea is very simple: Download your raw data from 23andMe (or use one of the files from me or my colleagues at Genomes Unzipped); Install the…

Why you CAN have your $1000 genome - so long as you learn what to do with it

January 7, 2011

As part of his Gene Week celebration over at Forbes, Matthew Herper has a provocative post titled "Why you can't have your $1000 genome". In this post I'll explain why, while Herper's pessimism is absolutely justified for genomes produced in a medical setting, I'm confident that I'll be obtaining…

Bioscience Resource Project critique of modern genomics: a missed opportunity

December 15, 2010

Late last week I stumbled across a press release with an attention-grabbing headline ("The Causes of Common Diseases are Not Genetic Concludes a New Analysis") linking to a lengthy blog post at the Bioscience Resource Project, a website devoted to food and agriculture. The post, written by two…