A Handy Transformation for Microbiome Data

When worked on the human microbiome, I regularly confronted a problem with the data. Species frequencies are almost never normally distributed ('the bell curve'), and if you want to use standard statistical techniques the data should be normally distributed. The second problem is that the data often have a lot of zero values. That is, if I look a bunch of gut samples from people (actually the data--the samples are VERY STINKY!), in many samples, a bacterial species* will be quite frequent (2-20%), but in other samples, it will be very rare (0.01%) or completely absent (i.e., 0%).

Often, people will use a log transformation of the data, but that presents problems if you have zeros (the log of 0 is undefined). One transformation that can handle zeros and frequency data** is the arcsine square root transformation. It turns out that economists have to deal with the same issue: some people earn millions, while others earn nothing. So what do they use?

They use the inverse hyperbolic sine transformation. Here's how Frances Woolley describes it:

Happily, there's an easy solution to this problem: the inverse hyperbolic sine transformation. It sounds intimidating and impressive; it isn't.

The inverse hyperbolic sine transformation is defined as:

log(yi+(yi2+1)1/2)

Except for very small values of y, the inverse sine is approximately equal to log(2yi) or log(2)+log(yi), and so it can be interpreted in exactly the same way as a standard logarithmic dependent variable. For example, if the coefficient on "urban" is 0.1, that tells us that urbanites have approximately 10 percent higher wealth than non-urban people.

But unlike a log variable, the inverse hyperbolic sine is defined at zero.

So why don't people use it? Why did I find myself this morning, once again, writing a revise-and-resubmit letter along the lines of "and re-do the estimation using a inverse hyperbolic sine transofrmation."

It's not that the inverse hyperbolic sine is fancy and new - John Burbidge, Lonnie Magee and Les Robb wrote a nice paper on it back in 1988, and that paper cites a 1949 piece by Johnson.

I think it's just a matter of ignorance. Most of the time, a log transformation will do the job, so that's what most people are familiar with. Plus now there are newer and sexier alternatives to the IHS, like quantile regression.

Seems like it could be useful.

*Actually, we typically use either genera, or operational taxonomic units ('OTUs'), which are a set of closely related bacteria.

*The issue with frequencies is that it's impossible to have a negative frequency. The arcsine square root transformation gets around this.

More like this

How often are non-parametric or other robust methods used in your, or similar, analyses?

Mike, thanks for the tip. Another alternative to those inappropriate zeros is to use microbiome hybridization intensity data instead of sequence counts. Microarray users who in the past profiled various mRNA levels from one species are now profiling various rRNA levels from entire communities using high density microarrays. One of the many advantages is that intensity data, even if it's below an assay's threshold is still non-zero.

what dean asked. i (a soil ecologist) tend to just go straight to non-parametric methods. there are lots of them available for free for use with R.

Interesting - I'll have to dig into that a bit more. I usually add a pseudo count of 1/2 or 1 to make the zero values visible on the log scale. Usually, the very low values are too noisy to be statistically significant anyway.

There are some cases where the low values do carry some very usueful information though, but I'd have to check whether the inverse hyperbolic sine transformation introduces any more distortions than using a pseudocount (which is more of a max likelihood approach anyway).

If all you're using it for is to fit some curve to the log-abundance data, chances are it doesn't matter much which approach you use. Note that hyperbolic sine also approaches log(y+1) for small values of y.

Thanks jdub. It's nice to know new spots where our stuff is used. Most of the students I teach are in business or health-related areas, and especially in the latter the classical stuff is firmly entrenched. And R is a very nice tool: it would have made grad/dissertation work much nicer had it been around in the 80s.

Typically, a transformation is used in order to make a non-normal distribution more so. I recognize that you are trying to do something about the zeroes in the data, but unless the net result of the transformation is to make the distribution look more normal, so that you can use Gaussian statistics, it hasn't accomplished the key objective.

Have you checked the transformed distribution, e.g. by plotting the data against normal scores, to see if you have "normalized" it?

If not, the better approach might be to use some of the so-called "distribution-free" statistics, as suggested by jdub. Here's a list of some of them: http://en.wikipedia.org/wiki/Non-parametric_statistics

Thanks for this, I've been struggling with data transformation lately.

I've got gas production and consumption data, the result of microbial activity in soils - actual molecular biology will commence as soon as I get all this other stuff cleared up and get off my butt and down into the lab. Anyway, I had to deal with non-normal distribution, lots of zeros, and a bunch of negative numbers; log-transformation took care of the normality problem (at least according to the normality-tests in R and Minitab), but at the cost of some additional transformations that still make me a little uncomfortable; we'll see what the reviewers think of what I've done.

I'll try out this inverse hyperbolic sine transformation, it seems like it could be a more straightforward (i.e. fewer lurking assumptions) approach. Any advice for dealing with the negatives? So far, I've tried separating them out, multiplying by -1 to clear them up, then log-transforming just like the rest of the data, though that makes it difficult to compare to zero (i.e. is the production or consumption of gas I observe significantly different from the null hypothesis of no production at all?) I've also tried adding a "floor" equal to slightly larger than the most negative value; this raises all values to greater-than-zero, and my comparison is then against the null of that floor, rather than zero. Seems to work, but I keep thinking that every manipulation or transformation of my data brings me a little further away from what's actually happening in my soils.

Thanks again, this is really interesting.

By TheBrummell (not verified) on 10 Aug 2011 #permalink

Minitab works for me, for normality tests. Re: negative numbers- I don't see what harm it does to add a constant value to all your values before transforming them. This just shifts the center of the distribution- no effect on variances, or on differences between means. I'd do it.