Since I’m at a Human Microbiome Project meeting, and don’t have time to write, I thought this post from the archives of Mad Biologist was appropriate:
A while ago, I talked about some things biologists should learn, and the glaring omission was mathematical fluency. I bring this up because one of the things the Mad Biologist does is work on the Human Microbiome Project (between that, and fighting evil, we are very busy…). The part of the Human Microbiome Project (‘HMP’) that I’m involved with is a consortium of four sequencing centers and an informatics center, whose goal is to sequence the microbes associated with 18 different body sites from 250 people. And math is vital to what we do.
Before I get to the reason why math matters, there’s one more bit of information you need. One component of the project is to PCR amplify the 16S gene–a gene that is found in every bacterium–and use this gene as a ‘barcode’ to determine what organisms are there and at what frequencies. In other words, this is the molecular microbial ecology of the human body.
Onto why math matters. A couple of weeks ago, I presented to the group some estimates of how many sequences we needed to observe every species in different body sites (Note: the technical term is ‘OTU’ or operational taxonomic unit which is the set of sequences all of which are similar to each other above some threshold; we use OTUs that are > 97% similar. For simplicity’s sake and reader familiarity, I will refer to OTUs as species). The four centers together have sequenced ~1000 sequences from every body site, so I could estimate how many species should appear in each body site based on the number (and distribution) of species we observed. I could also estimate how many new, unobserved species we should see as we add more sequences. Rubbing the previous two sentences together, it’s possible to figure out how many sequences we need to see each species once (yes, there are confidence intervals attached to these estimates…).
Since this part of the HMP is a collaborative effort, I sent around my figures and methods before our weekly phone conference. I had a sneaking suspicion that I would get a lot of questions, and I was right. Over and over, I was told that these estimates were ‘counter-intuitive.’ Why? Sites that had relatively few species required nearly as much sequencing (and, in some cases, more than) as those sites that had lots of species. Now, being a probability theory dork (and trained as an ecologist), this wasn’t counterintuitive to me at all, but to my colleagues, the idea that a less complex–that is, species rich-community–would require as much sequencing didn’t make sense.
But how deep you need to sequence depends on the frequencies of the rarest species. To put this another way, if in a community of 20 species, the two rarest species occur at a frequency of 1/10,000, to see them you will have to sequence much more than in a community of 100 species, all of which occur at equal frequencies (1/100). Or, put another way, in the skewed twenty species community, adding additional sequence is very unlikely to reveal new species as compared to the equally distributed community of 100 species.
Now, my colleagues are very smart people, and they got it once I explained it to them. But this example demonstrates why mathematics as a way of thinking is vital for biologists.
Probability theory shouldn’t be counterintuitive.