Earlier this week, I attended the International Human Microbiome Consortium Meeting (the human microbiome consists of the organisms that live on and in us). I’m not sure to make of the whole microbiome initiative, but one thing is clear to me: this is being driven by the wrong group of scientists.
Instead of being directed by biologists (medical primarily) who have devised a set of important questions, and want to use the power of high throughput genomics, including metagenomics which sequences of all the DNA in a specimen–bacteria, viruses, fungi, protozoa, and, yes, human (which raises all sorts of bioethics questions), the human microbiome is being driven in large part by the major genome centers.
Of course, the major centers need to be involved: they’re the only ones who have the sequencing capacity, along with the sequence data assembly and annotation (figuring out what the sequence encodes) capabilities. Likewise, these centers do have plenty of biologists who are quite competent at designing experiments. What centers often lack, depending on the area, are the experts who know precisely what questions need to be asked, and who have also figured out how to analyze the data.
This brings me to my first concern. Much of the initial focus is on really complex systems, such as the human gut, which contains hundreds of bacterial species (that’s before you get to the viruses and eukaryotes). Because there are so many different genomes, even with massive throughput, most genomes recovered will be fragmentary–very fragmentary. I’m not sure what that will tell us.
Second, we don’t have enough reference genomes–a recent estimate of the number of Streptococcus pneumoniae genomes needed to find ninety percent of the total ‘pan-genome’ of that single species was 142. There are going to be a lot of genes that we won’t be able to figure out where they came from. I’m not really interested in the diversity of gyrase B protein–to a considerable extent, it’s the variable loci that will be interesting, and these will be the hardest to determine to whom they belong.
Third, these will be awful data to analyze. Here’s why. Ideally, you want as many replicates as possible (in this case, human volunteers), and only enough data as needed to answer the question. The last part doesn’t seem to make sense until you consider that when you conduct enough tests, some of them will yield false positives (one in twenty if you use a p = 0.05 cutoff) unless you correct for this*, which means that your power of test (i.e., the p-value) becomes really small (one in a million or worse). This will be the mother of all SNP hunts. In human genomics, they deal with this problem all the time; the latest technology can screen a single genome for 900,000 SNPs–and these studies have to enroll (or combine smaller studies) thousands of people. With the human microbiome, we will have maybe 500 human volunteers, each of which is associated with literally megabytes (if not gigabytes) of data.
The mother of all SNP hunts, indeed.
So what to do? First, all of the genome centers, big and small, need to collaborate on a single system. Second, this system needs to be simple–not very many species–so we can begin to get some kind of replication in microbial communities that can be statistically assessed**. Third, specific questions need to be asked. We can’t just go searching for what is ‘out there.’ We need specific hypotheses so we don’t drown in all of the data. We’ll collect lots of ‘excess’ data whether we like it or not, so the signal to noise ratio needs to maximized as much as possible.
OK, I’ll stop now, so I don’t lose my three remaining readers….
*The Bonferroni correction is one such method.
**As I’ve mentioned before, if we want to do high-throughput Latin binomials (species counting), the problem gets much simpler. It also doesn’t require metagenomics; I’m dealing with metagenomic approaches here.