David Dooling from PolITiGenomics has put together a handy little table for genomics nerds like me: statistics on the output of the various iterations of the three major competing second-generation DNA sequencing platforms (Roche's 454, Illumina's Solexa/Genome Analyzer and ABI's SOLiD).
It's a little inscrutable for non-genomicists, but it helps to provide some insight into the sheer scale of the DNA sequence data currently being produced by large-scale sequencing facilities. A single Illumina GA II machine, for instance, churns out at least 8 gigabases of sequence (that's almost three human genome equivalents) every week. Now consider that the Sanger Institute as of this week has 37 Illumina machines, all running hot pretty much 24/7 (well, most of the time), and you have some sense of the quantity of sequence currently being generated - and of the informatics infrastructure required to store, manage and process that volume of data.
- Log in to post comments
So 8 billion 2 bit numbers (4 bases right?) per week. That is 16 billion bits or 2 gigabytes of data.
Quadruple that for mirroring, and overhead, then add 2 more copies for backup and analysis gives 2 Gig per day of data per machine. Multiply by 40 machines and you've got 80 Gig a day. 2 and a half terabytes per month. Getting somewhat serious but not that expensive. With appropriate compression I am sure you could get 4 to 1 compression on this or better. About 10 TB a year. Not much nowadays. $20K should do for up front costs including a tape backup. Admin costs should dominate your budget. The storage would be a pittabnce compared to the cost of the analyzers.
Dooling got me drooling!
Hi Mark,
Well, it's not quite that simple.
The initial data generated by the machines are in the form of enormous image files, taking up on the order of 1-1.5 Tb per week. Processing those data into useful sequence files takes some fairly hefty computing; and then after that you can't simply store the raw sequence (as in your calculations) but also a whole bunch of meta-data like quality scores, which are important for downstream analysis. According to David's figures the final volume of data submitted for storage is on the order of 100 Gb per machine per week (albeit presumably without compression).
Multiply that by 40 machines over a year and you're looking at around 200 Tb a year of data for storage, plus the computational power to chew through the primary analysis of ~50 Tb of image data every week, and then for downstream processing (assembly, identification of genetic variants, and so on).
And of course this is just the beginning - in a year or so the current platforms will be churning out substantially more data (with tweaking and upgrades), and there will be new platforms with even more laughably high output.
I'm still new to next-gen so I can't really talk much about the details, but I gather from the people I'm working with that handling the informatics side is now posing a far greater challenge than actually generating the data (which is itself non-trivial).
Is there an FTP site where one can download sample next gene sequencing data? Or is it too big? Maybe you can send somebody a 1TB hard disk and they send it back.
dogface - you can play around with the data in the Short Read Archive, but the downloads do tend to be pretty large.