My Genius Prediction About the Short Read Archive: TOO MANY DATAZ!

The National Center for Biotechnology Information (NCBI) recently announced that it will shut down the Short Read Archive (SRA). The SRA stored the semi-processed data for genomics projects, so researchers could examine the raw data for a genomics project. The reason given by NCBI is "budget constraints." While I'm saddened by this, I'm not surprised, since the volume of data produced by a single genome center is tremendous, to the point where the storage and data upload are prohibitive:

when several centers were collaborating to test new sequencing technologies, the data were so large, they actually shipped hard drives to each other to compare results. Well, that's what might have to happen to upload data:

If cloud computing is to work for genomics, the service providers will have to offer some flexibility in how large datasets get into the system. For instance, they could accept external disks shipped by mail the way that the Protein Database once accepted atomic structure submissions on tape and floppy disk. In fact, a now-defunct Google initiative called Google Research Datasets once planned to collect large scientific datasets by shipping around 3-terabyte disk arrays.

The other possibility is that the raw data, or even 'first-step' processed data might not be made publicly available anymore--think of this as the physics model:

At some future point it will become simply unfeasible to store all raw sequencing reads in a central archive or even in local storage. Genome biologists will have to start acting like the high energy physicists, who filter the huge datasets coming out of their collectors for a tiny number of informative events and then discard the rest.

As genomics and other data-intensive disciplines of biology move towards cloud computing (and I think it will definitely happen), it will be interesting to see how NIH funding shifts.

Well, now we know how one part of that funding will shift.

Categories

More like this

While commonly called "short read archive" it is, in fact, "sequence read archive" I continually fixed that on several of my sequencing white papers only to have other people change it back on me :p

What's interesting to me is that the sequencing center contract included (strict?) language about deposition of sequence to the SRA and Trace archives. So I guess now all that sequence data just gets deleted or whatever when the assemblies are submitted?

And I wonder how long GEO will be able to take RNA-seq data sets before they have data issues. I mean, I know it's not a big issue now compared to the Illumina and SOLiD sequencing that places are pumping out, but I imagine the rate of experiments being done will keep increasing.

This is such a pity. Guess future bioinformatics students are screwed and never mind those who want to check up on the data themselves.

Presumably those running the sequencing experiments will keep their data for a while.. but its no help if two years down the line someone wants to improve the the assembly using better software or new partial sequences. Guess they'll just have to resequence from scratch.

Perhaps google could be convinced to take over the archive?

By BiochemStudent (not verified) on 15 Mar 2011 #permalink