…or it won’t be much of a revolution. Yesterday, I discussed the difference between a DNA sequencing revolution and a genomics revolution, and how we have a long way to go before there’s a genome sequencer in every pot (or something). But let’s say, for argument’s sake, these problems are overcome–and I think they will be.
Then the real trouble begins.
The big issue is standardization–without it we will have a genomic Towel of Babel:
“There is a growing gap between the generation of massively parallel sequencing output and the ability to process and analyze the resulting data,” says Canadian cancer research John McPherson, feeling the pain of NGS [next generation sequencing] neophytes left to negotiate “a bewildering maze of base calling, alignment, assembly, and analysis tools with often incomplete documentation and no idea how to compare and validate their outputs. Bridging this gap is essential, or the coveted $1,000 genome will come with a $20,000 analysis price tag.”
Without some sort of standardization of genome assembly and annotation (gene identification) methods, we’re going to have real problems. In human genomes, will a SNP (a change in the smallest subunit of DNA) be due to assembly issues? It’s worse for microbial genomes.
Within a bacterial species, there is a ‘core’ genome–a set of genes that they all share. But a lot of the interesting biology happens in the auxillary genome–genes that are found only in some strains. If a gene is absent, is it really absent? Or is it just a result of either ‘bad’ assembly or gene calling?
We could get around this by going to the raw (or semi-processed) data. Currently, all NIH funded projects are required to upload raw data to NCBI. However, there will soon be far too much raw data for NCBI to store. Then what?
While this might not appear to be a problem if genomics is simply used as a diagnostic, I would argue it is. In the field of antibiotic resistance, when a hospital lab determines what drugs can kill an infecting bacterium, that information typically is not shared–it’s just diagnostic. However, states are increasingly requiring hospitals to report and share this information for surveillance purposes (which is an excellent thing to be doing*). If we generate a lot of genomic information (human or microbial) and it just sits in a file somewhere, it’s not exactly fomenting revolution, is it? The data have to be standardized to be broadly used.
A related issue is metadata–the clinical and other non-genomic data attached to a sequence. Just telling me that a genome came from a human isn’t very useful: I want to know something about that human. Was she sick or healthy, and so on. These metadata too, will have to be standardized: I can’t say one of genome came from someone who was “sick”, while you provide another genome from someone who had “inflammatory bowel disease.” Worse, I can’t say my patient had IBD, while yours had Crohn’s disease. The data fields have to be standardized, so we’re not comparing apples and oranges.
Without these two types of standardization, we won’t have a genomic revolution, but genomic anarchy.
*Of course, states aren’t willing to pay for it….
Update: Keith Robison has a very good post about the Ion Torrent technology.