The Genomics Revolution Will Be Standardized...

...or it won't be much of a revolution. Yesterday, I discussed the difference between a DNA sequencing revolution and a genomics revolution, and how we have a long way to go before there's a genome sequencer in every pot (or something). But let's say, for argument's sake, these problems are overcome--and I think they will be.

Then the real trouble begins.

The big issue is standardization--without it we will have a genomic Towel of Babel:

"There is a growing gap between the generation of massively parallel sequencing output and the ability to process and analyze the resulting data," says Canadian cancer research John McPherson, feeling the pain of NGS [next generation sequencing] neophytes left to negotiate "a bewildering maze of base calling, alignment, assembly, and analysis tools with often incomplete documentation and no idea how to compare and validate their outputs. Bridging this gap is essential, or the coveted $1,000 genome will come with a $20,000 analysis price tag."

Without some sort of standardization of genome assembly and annotation (gene identification) methods, we're going to have real problems. In human genomes, will a SNP (a change in the smallest subunit of DNA) be due to assembly issues? It's worse for microbial genomes.

Within a bacterial species, there is a 'core' genome--a set of genes that they all share. But a lot of the interesting biology happens in the auxillary genome--genes that are found only in some strains. If a gene is absent, is it really absent? Or is it just a result of either 'bad' assembly or gene calling?

We could get around this by going to the raw (or semi-processed) data. Currently, all NIH funded projects are required to upload raw data to NCBI. However, there will soon be far too much raw data for NCBI to store. Then what?

While this might not appear to be a problem if genomics is simply used as a diagnostic, I would argue it is. In the field of antibiotic resistance, when a hospital lab determines what drugs can kill an infecting bacterium, that information typically is not shared--it's just diagnostic. However, states are increasingly requiring hospitals to report and share this information for surveillance purposes (which is an excellent thing to be doing*). If we generate a lot of genomic information (human or microbial) and it just sits in a file somewhere, it's not exactly fomenting revolution, is it? The data have to be standardized to be broadly used.

A related issue is metadata--the clinical and other non-genomic data attached to a sequence. Just telling me that a genome came from a human isn't very useful: I want to know something about that human. Was she sick or healthy, and so on. These metadata too, will have to be standardized: I can't say one of genome came from someone who was "sick", while you provide another genome from someone who had "inflammatory bowel disease." Worse, I can't say my patient had IBD, while yours had Crohn's disease. The data fields have to be standardized, so we're not comparing apples and oranges.

Without these two types of standardization, we won't have a genomic revolution, but genomic anarchy.

*Of course, states aren't willing to pay for it....

Update: Keith Robison has a very good post about the Ion Torrent technology.

More like this

Matthew Herper rounds up some of the discussion about the decreasing cost of genomics. But one thing that hasn't been discussed much at all is the cost of all of the other things needed to make sense of genomes, like metadata. I briefly touched on this issue previously: A related issue is…
The exciting thing about the recent technological advances in genomics is that we have a massive amount of data. The terrifying thing about the recent technological advances in genomics is that we have a massive amount of data. A while ago, I brought this up in the context of bacterial genomics:…
Last year, Craig Venter became the first single person to have his genome sequence published (doi:10.1371/journal.pbio.0050254). That genome was sequenced using the old-school Sanger technique. It marked the second time the complete human genome had been published (which led to some discussion as…
Since I'm on-route to a Human Microbiome Project meeting (uncharacteristically, it's being held in a climate-friendly location--Houston; last year, it was held in Boston. In January.), reviewing this paper about the GEBA project, the Genomic Encyclopedia of Bacteria and Archaea, seemed appropriate…

"Standards are wonderful! There are so many to choose from!"

Although I guess this is a bit of a tangent, standardisation might be viewed as one solution to the reproducible research issue (excuse my pimping my article). The reference to âoften incomplete documentation and no idea how to compare and validate their outputsâ rings a bell!

Standardisation and reproducibility are two different issues, but are certainly related, and I think it may help to think of the two together. Just my idle 2c before getting a coffee on :-)

Good post - thanks for the thoughts.

Perhaps you might be interested in the work of the Genomic Standards Consortium (http://www.gensc.org) and their open access publication Standards in Genomic Sciences (soon to appear in PubMed Central).

I think you will find that there is already a large and growing community with similar concerns and interests. The GSC has been working on this topic since 2005 and has published minimal standards for genome sequences, metagenome sequences and environmental sequences. There are also well established standards for describing draft and finished genome sequences.

There is a natural standard for genomic data, the evolutionary history of the biosphere. The MasterCatalog, which we released a decade ago as a commercial product and was purchased by a number of companies, adopted it:

Benner, S. A., Chamberlin, S. G., Liberles, D. A., Govindarajan, S., Knecht, L. (2000) Functional inferences from reconstructed evolutionary biology involving rectified databases. An evolutionarily-grounded approach to functional genomics. Research Microbiol. 151, 97-106

LOL, I know from my own life and observations of others' that 'standardization' and 'reproducibility' are just a nice way of saying that most of the time we really haven't a clue as to what we're doing. Instead we are relying on mountains of previous 'standardizations' and 'reproducibilities' with the expectation these will allow our own assumptions to validate what we've be told to expect.

And when these assumptions are not validated...

Quick, get some Content Managers on this to help you define the relevant parameters and standardized terms. They are used to dealing with large volumes of information from multiple sources.