Reporting on the human microbiome–the microorganisms that live on and in us–is quite the rage these days. As someone who is involved in NIH’s Human Microbiome Project, it’s a pretty exciting time because the size and scale of the data we’re able to generate is unprecedented.

This also means we have to figure out how to not only generate, but also *analyze* these data. One of the kinds of data we generate are 16S rRNA sequences, which are found in all bacteria and can be used as a ‘barcode’ to identify and quantify the bacteria in a community without having to culture each species. A recent paper which has received some coverage compared the gut microbiomes (actually, the bacteria found in stool) of children who eat a Western diet with those raised in Burkina Faso who had a fiber-rich diet. The authors found that the two populations had very different compositions of bacteria based on the frequencies–and that last word will become very important–of different bacteria phyla.

While it’s an interesting paper, I didn’t like how the authors assessed statistically significant differences at all.

What the authors did (and in fairness, many others do too) is compared the frequencies of different bacterial between the two populations. They did this using two different methods:

1)

Analysis of Variance (ANOVA).For any taxon, this test estimates a normal distribution (‘bell curve’) of frequencies for the Burkina Faso population and for the European population, and then determines how much these two distributions overlap. If there isn’t enough overlap, then we conclude that the frequencies of a particular bacterium in the two populations are different. In other words, given that people vary in the frequency of bacterium X living in their guts, is this variability associated with geography? If we do find a significant association, we can determine how much of the variation can be accounted for by geography (i.e., living in either Burkina Faso or the EU).The downside of ANOVA is that it assumes the data are normally distributed. If not, this can lead to problems, although usually what happens is that we miss significant effects (false negatives), as opposed to thinking something matters when it actually doesn’t (false positives). More about this in a bit.

2)

Non-parametric tests (Kruskal-Wallis rank sum test*). Without getting into details, all of the frequencies in our sample are rank ordered from highest to lowest, and the sums of the ranks of the two groups (Burkina Faso and EU) are compared (I’m leaving out a lot of math here). Unlike ANOVA, this test doesn’t assume a normal distribution of data; it does, however, assume that the data in the two groups are distributed in roughly the same way (something people seem to forget–they think this test is magic. It’s not). In my experience with microbiome data, this assumption is also violated.

I mentioned eariler that these are frequency data. This is critical to understanding what I think is the preferable approach. The reason normality is violated for ANOVA and subgroups will often differ in distributions (Krusal-Wallis) is *because* these are frequency data.

To understand what this means, let’s delve further into what exactly these data are. We start with a biological sample, in this case a lump of shit that has been, erm, ‘donated’**. The cells are killed, and the 16S rRNA (found only in bacteria) is amplified using PCR. We now have a tube full of billions of 16S rRNA molecules that we can sequence. But what we have also done is disassociate the abundance from the relative proportions. Suppose I decide to sequence 5,000 molecules per sample. If a taxon, for instance Bacteroides has a higher frequency in one sample, by definition, all the other species must have lower frequencies. This is essentially a zero sum game: if one organism goes up, the others must go down.

Regarding ANOVA, you won’t typically get normal distributions (whereas we often observe normality with absolute data, such as height) with frequency data, as frequency data are binomially distributed. Likewise, if subsamples differ in their mean frequencies, their distributions will differ, potentially violating Krusal-Wallis test assumptions.

So what do we do? Fortunately, there are these people called ecologists who have had to deal with frequency data. Because a sample from a pool of 16S molecules really isn’t any different than looking at how much of a crowded square meter of rocky shore is covered by algae:

**Algae: not as stinky as your butt
(from here)**

If one species of algae increases, that means the other species have to decrease, just like with the 16S sequences. So what have ecologists done? It’s pretty simple: they transform the frequencies by finding arcsine square root*** of each frequency (thus, frequencies will range between 0 and 1.57). In statistical terms, this disassociates the mean from the variance. It also makes the data ‘normal-esque’ or even normally distributed.

What this data transformation allows us to do is then use methods like ANOVA. Not only can we see how strong a given effect (e.g., Burkina Faso versus the EU) is, but we can even compare the relative importance of effects (e.g., gender).

I won’t claim that this is ‘the’ answer (it’s too early for that), and, as I’ve discussed before, other people like a phylogenetic approach to these data, but this subdiscipline really needs to start thinking rigorously about how to approach these data. We also shouldn’t reinvent the wheel either: big critter ecologists have thought about this for decades, and we should learn from them.

As more data come out, and more studies are released, it will be interesting to see how people analyze these data.

It might make me very Mad….

*****Techinically, comparing two and only two samples uses the Mann-Whitney U test, but the underlying principle is the same.

******Most of microbiology deals with slime, ooze, pus, or shit. It requires a strong stomach.

*******Some suggest using the logit transformation instead. One issue with the logit, however, is how one handles zero values, as any frequency below 50% turns into a negative number. One could claim that a zero value is actually equivalent to a very small number, and then add that value to all of the logit transforms (thus bringing the very rare frequencies to a value greater than zero). The problem, here, is that, depending on the value added, significance and strength of effect can vary: there is a difference between comparing 3 to 5 versus 5 to 7. Thus, I like the arcsine square root transformation, especially with datasets where frequencies range widely, with a fair amount of zero values.

**Cited article:** De Filippo C, Cavalieri D, Di Paola M, Ramazzotti M, Poullet JB, Massart S, Collini S, Pieraccini G, & Lionetti P (2010). Impact of diet in shaping gut microbiota revealed by a comparative study in children from Europe and rural Africa. Proceedings of the National Academy of Sciences of the United States of America PMID: 20679230