My Genius Prediction About the Short Read Archive: TOO MANY DATAZ!

The National Center for Biotechnology Information (NCBI) recently announced that it will shut down the Short Read Archive (SRA). The SRA stored the semi-processed data for genomics projects, so researchers could examine the raw data for a genomics project. The reason given by NCBI is "budget constraints." While I'm saddened by this, I'm not surprised, since the volume of data produced by a single genome center is tremendous, to the point where the storage and data upload are prohibitive:

when several centers were collaborating to test new sequencing technologies, the data were so large, they actually shipped hard drives to each other to compare results. Well, that's what might have to happen to upload data:

If cloud computing is to work for genomics, the service providers will have to offer some flexibility in how large datasets get into the system. For instance, they could accept external disks shipped by mail the way that the Protein Database once accepted atomic structure submissions on tape and floppy disk. In fact, a now-defunct Google initiative called Google Research Datasets once planned to collect large scientific datasets by shipping around 3-terabyte disk arrays.

The other possibility is that the raw data, or even 'first-step' processed data might not be made publicly available anymore--think of this as the physics model:

At some future point it will become simply unfeasible to store all raw sequencing reads in a central archive or even in local storage. Genome biologists will have to start acting like the high energy physicists, who filter the huge datasets coming out of their collectors for a tiny number of informative events and then discard the rest.

As genomics and other data-intensive disciplines of biology move towards cloud computing (and I think it will definitely happen), it will be interesting to see how NIH funding shifts.

Well, now we know how one part of that funding will shift.

Categories

More like this

Genome Biology recently published a review, "The Case for Cloud Computing in Genome Informatics." What is cloud computing? Well: This is a general term for computation-as-a-service. There are various different types of cloud computing, but the one that is closest to the way that computational…
These days, DNA sequencing happens in one of three ways. In the early days of DNA sequencing (like the 80's), labs prepared their own samples, sequenced those samples, and analyzed their results. Some labs still do this. Then, in the 90's, genome centers came along. Genome centers are like giant…
...or it won't be much of a revolution. Yesterday, I discussed the difference between a DNA sequencing revolution and a genomics revolution, and how we have a long way to go before there's a genome sequencer in every pot (or something). But let's say, for argument's sake, these problems are…
Last week, Forbes had an article about the advances in genomics, which focused on the Ion Torrent sequencing platform. It's a good overview of genomics and the Ion Torrent technology, albeit a bit much on the cheerleading side. For instance, this: Audaciously named the Personal Genome Machine (…

While commonly called "short read archive" it is, in fact, "sequence read archive" I continually fixed that on several of my sequencing white papers only to have other people change it back on me :p

What's interesting to me is that the sequencing center contract included (strict?) language about deposition of sequence to the SRA and Trace archives. So I guess now all that sequence data just gets deleted or whatever when the assemblies are submitted?

And I wonder how long GEO will be able to take RNA-seq data sets before they have data issues. I mean, I know it's not a big issue now compared to the Illumina and SOLiD sequencing that places are pumping out, but I imagine the rate of experiments being done will keep increasing.

This is such a pity. Guess future bioinformatics students are screwed and never mind those who want to check up on the data themselves.

Presumably those running the sequencing experiments will keep their data for a while.. but its no help if two years down the line someone wants to improve the the assembly using better software or new partial sequences. Guess they'll just have to resequence from scratch.

Perhaps google could be convinced to take over the archive?

By BiochemStudent (not verified) on 15 Mar 2011 #permalink