'Counterintuition', the Human Microbiome, and Why Fluency in Math Matters

Since I'm at a Human Microbiome Project meeting, and don't have time to write, I thought this post from the archives of Mad Biologist was appropriate:

A while ago, I talked about some things biologists should learn, and the glaring omission was mathematical fluency. I bring this up because one of the things the Mad Biologist does is work on the Human Microbiome Project (between that, and fighting evil, we are very busy...). The part of the Human Microbiome Project ('HMP') that I'm involved with is a consortium of four sequencing centers and an informatics center, whose goal is to sequence the microbes associated with 18 different body sites from 250 people. And math is vital to what we do.

Before I get to the reason why math matters, there's one more bit of information you need. One component of the project is to PCR amplify the 16S gene--a gene that is found in every bacterium--and use this gene as a 'barcode' to determine what organisms are there and at what frequencies. In other words, this is the molecular microbial ecology of the human body.

Onto why math matters. A couple of weeks ago, I presented to the group some estimates of how many sequences we needed to observe every species in different body sites (Note: the technical term is 'OTU' or operational taxonomic unit which is the set of sequences all of which are similar to each other above some threshold; we use OTUs that are > 97% similar. For simplicity's sake and reader familiarity, I will refer to OTUs as species). The four centers together have sequenced ~1000 sequences from every body site, so I could estimate how many species should appear in each body site based on the number (and distribution) of species we observed. I could also estimate how many new, unobserved species we should see as we add more sequences. Rubbing the previous two sentences together, it's possible to figure out how many sequences we need to see each species once (yes, there are confidence intervals attached to these estimates...).

Since this part of the HMP is a collaborative effort, I sent around my figures and methods before our weekly phone conference. I had a sneaking suspicion that I would get a lot of questions, and I was right. Over and over, I was told that these estimates were 'counter-intuitive.' Why? Sites that had relatively few species required nearly as much sequencing (and, in some cases, more than) as those sites that had lots of species. Now, being a probability theory dork (and trained as an ecologist), this wasn't counterintuitive to me at all, but to my colleagues, the idea that a less complex--that is, species rich-community--would require as much sequencing didn't make sense.

But how deep you need to sequence depends on the frequencies of the rarest species. To put this another way, if in a community of 20 species, the two rarest species occur at a frequency of 1/10,000, to see them you will have to sequence much more than in a community of 100 species, all of which occur at equal frequencies (1/100). Or, put another way, in the skewed twenty species community, adding additional sequence is very unlikely to reveal new species as compared to the equally distributed community of 100 species.

Now, my colleagues are very smart people, and they got it once I explained it to them. But this example demonstrates why mathematics as a way of thinking is vital for biologists.

Probability theory shouldn't be counterintuitive.

More like this

If probably theory was intuitive, Jason Rosenhouse wouldn't have needed to write his book (and most people wouldn't still fall for it). All mathematics above a certain level become counterintuitive. If they weren't, we wouldn't need to teach them the way we have to and there would be far less "math is hard" articles coming out.

On the other hand, for anybody working as a scientist with all of the education they were supposed to have received on probability in order to understand standard deviations, such maths though not "intuitive" should still become ingrained enough as if they were second nature.

By Joe Shelby (not verified) on 21 Jul 2010 #permalink

Slightly OT, but while reading that Nature paper from last week about the viruses in the microbiome, I was wondering what the resolution is for sequencing the bacteria based on 16S rRNA. When they sequenced, the viruses, they were using random genomic primers, and sequencing whole virus genomes, but the bacterial sequencing seems less specific.

Can you actually say that the viruses are more diverse between people than the bacteria if the sequencing strategies are so different?

Considering the common usage of micrometers in biology and microbiology...I think we should, more specifically, learn micromath!

I only hate micromath a little bit? :p

I think that your explanation is great! It leads towards the information theoretic explanation of this, which says that as the sizes of your populations diverge, each individual data point that you gather is less informative.

You might also try explaining this in terms of chemical separations, many people who are familiar with mass spectrometry know that it's hard to get information about low-abundance species of proteins in heterogeneous samples. Techniques that separate low-abundance bacteria from high-abundance bacteria would be very useful in this case, but I can't imagine how you'd do that.

It was my understanding that there would be no math.

James you might not be able to separate the bacteria themselves easily. Maybe you could have antibodies against the known, most populous members and use them to "pull out" their targets from a fixed sample. (For brevity's sake I left out a lot of details there :P)

But I dunno if that will work well on a sample like poop or dirt.

I think if you knew who the most over represented species were before hand, you might try making very specific biotinylated "capture sequences" for their 16s rRNAs. The technical limitation here is that due to the conserved nature of 16s rRNA, it might be harder than it sounds. But if that was doable, you could reduce the amount of the most common species in your sample and hopefully get away with less sequencing being able to identify the rare members of the population.

I'm assuming people have either tried this before or are working on it or something as it's not a particularly novel idea. And if no one has, then damn :p