I realize that the typical format for blogging is to find something that pisses you off and then rant about it, but I actually like the recent workshop report by NHGRI, "The Future of DNA Sequencing at the National Human Genome Research Institute." (pdf file) While I'll have more to say about the report overall, I liked the section about the Human Microbiome Project (the goal of the HMP is to use sequencing technologies to understand how the microbes that live on us and in us affect health and disease).
I was happy to see that NHGRI still thinks that it has a role in funding the HMP. It's never been clear to what extent various NIH institutes would continue to support microbiome research after the Jumpstart funding runs out in 2013.
I also like the statement of what NHGRI's microbiome activities should be:
- Analyze microbiomes of many more normal subjects than is now being considered for the Human Microbiome Project (HMP), in order to obtain a fuller appreciation for the range of microbial communities in the human, and how their composition relates to environmental and other factors.
- Include the sequencing of host genomes.
- Integrate microbiome information with other projects, for example 1000 Genomes or GTEx.
- Analyze the microbiome of model organisms, for example to enable experimental analysis.
- Use sequencing to attain a more fundamental understanding of microbe biology, for example microbial communities, gene transfer, and other fundamental biology.
The only thing I would add (and I'll discuss it in the future) is that there needs to be much more support for developing the bioinformatic and analytical tools to analyze the massive amounts of data. Most of the methods used were designed in an era when microbiome datasets were orders of magnitude smaller. These methods often don't scale up because these methods are 'N-squared' problems: as the N, the number of sequences increases, the number of calculations required increases by N x N (or more). We simply don't have really good methods to handle and then analyze hundreds of millions of sequences. Without this, we won't be able to use the data as well as we could.
But, overall, this is very encouraging. I hope NHGRI listens.
I've been working on the N**2 problem for a couple years (e.g., FastBLAST & FastTree), and I'm surprised that more people aren't worrying about it...
Something that I don't understand. For these very earliest tests, shouldn't the emphasis be on trying to discover the range of what is or might be present? Shouldn't the emphasis being on trying to find maximum diversity?
Instead of looking at a large number of "normal" individuals, shouldn't there be inclusion of abnormal individuals?
An analogy would be if you were trying to find the biodiversity of land, you don't want to look only at plots of land that are being farmed using herbicides (the analogy being land farmed using herbicides is analogous to washed skin of humans in good health).
We know the diversity of "wild" land that has never been used for agriculture is much greater than land that has been clear-cut and planted in monoculture.