A Handy Transformation for Microbiome Data

By mikethemadbiologist on August 9, 2011.

When worked on the human microbiome, I regularly confronted a problem with the data. Species frequencies are almost never normally distributed ('the bell curve'), and if you want to use standard statistical techniques the data should be normally distributed. The second problem is that the data often have a lot of zero values. That is, if I look a bunch of gut samples from people (actually the data--the samples are VERY STINKY!), in many samples, a bacterial species* will be quite frequent (2-20%), but in other samples, it will be very rare (0.01%) or completely absent (i.e., 0%).

Often, people will use a log transformation of the data, but that presents problems if you have zeros (the log of 0 is undefined). One transformation that can handle zeros and frequency data** is the arcsine square root transformation. It turns out that economists have to deal with the same issue: some people earn millions, while others earn nothing. So what do they use?

They use the inverse hyperbolic sine transformation. Here's how Frances Woolley describes it:

Happily, there's an easy solution to this problem: the inverse hyperbolic sine transformation. It sounds intimidating and impressive; it isn't.

The inverse hyperbolic sine transformation is defined as:

log(y_i+(y_i²+1)^1/2)

Except for very small values of y, the inverse sine is approximately equal to log(2y_i) or log(2)+log(y_i), and so it can be interpreted in exactly the same way as a standard logarithmic dependent variable. For example, if the coefficient on "urban" is 0.1, that tells us that urbanites have approximately 10 percent higher wealth than non-urban people.

But unlike a log variable, the inverse hyperbolic sine is defined at zero.

So why don't people use it? Why did I find myself this morning, once again, writing a revise-and-resubmit letter along the lines of "and re-do the estimation using a inverse hyperbolic sine transofrmation."

It's not that the inverse hyperbolic sine is fancy and new - John Burbidge, Lonnie Magee and Les Robb wrote a nice paper on it back in 1988, and that paper cites a 1949 piece by Johnson.

I think it's just a matter of ignorance. Most of the time, a log transformation will do the job, so that's what most people are familiar with. Plus now there are newer and sexier alternatives to the IHS, like quantile regression.

Seems like it could be useful.

*Actually, we typically use either genera, or operational taxonomic units ('OTUs'), which are a set of closely related bacteria.

*The issue with frequencies is that it's impossible to have a negative frequency. The arcsine square root transformation gets around this.

More like this

Sunday Function

Step right up, Ladies and Gentlemen! Get your ticket to see the True Oddities of the Natural World! Do not be taken in by the Shameful Forgeries at Inferior Circuses, here you will see Genuine Curiosities from the Mists of Time!

LINEs dont do RT like ERVs, HIV, or HBV!

The only creatures that existed in the RNA World that still exist today, are viruses. They are the only creatures that still use RNA to store their genome-- they never upgraded to V2.0, DNA.

Hearing The Uncertainty Principle

If you read about science at all, you've heard of Heisenberg's uncertainty principle. It's the canonical example of quantum weirdness, the strange idea that you can't simultaneously know the position and momentum of a particle.

Junk is what junk does

Randy Stimpson is someone a few may recall here: he was a particularly repetitious and dishonest creationist who earned himself a spot in the dungeon.

How often are non-parametric or other robust methods used in your, or similar, analyses?

Mike, thanks for the tip. Another alternative to those inappropriate zeros is to use microbiome hybridization intensity data instead of sequence counts. Microarray users who in the past profiled various mRNA levels from one species are now profiling various rRNA levels from entire communities using high density microarrays. One of the many advantages is that intensity data, even if it's below an assay's threshold is still non-zero.

what dean asked. i (a soil ecologist) tend to just go straight to non-parametric methods. there are lots of them available for free for use with R.

Interesting - I'll have to dig into that a bit more. I usually add a pseudo count of 1/2 or 1 to make the zero values visible on the log scale. Usually, the very low values are too noisy to be statistically significant anyway.

There are some cases where the low values do carry some very usueful information though, but I'd have to check whether the inverse hyperbolic sine transformation introduces any more distortions than using a pseudocount (which is more of a max likelihood approach anyway).

If all you're using it for is to fit some curve to the log-abundance data, chances are it doesn't matter much which approach you use. Note that hyperbolic sine also approaches log(y+1) for small values of y.

useful entry thanks

Thanks jdub. It's nice to know new spots where our stuff is used. Most of the students I teach are in business or health-related areas, and especially in the latter the classical stuff is firmly entrenched. And R is a very nice tool: it would have made grad/dissertation work much nicer had it been around in the 80s.

Typically, a transformation is used in order to make a non-normal distribution more so. I recognize that you are trying to do something about the zeroes in the data, but unless the net result of the transformation is to make the distribution look more normal, so that you can use Gaussian statistics, it hasn't accomplished the key objective.

Have you checked the transformed distribution, e.g. by plotting the data against normal scores, to see if you have "normalized" it?

If not, the better approach might be to use some of the so-called "distribution-free" statistics, as suggested by jdub. Here's a list of some of them: http://en.wikipedia.org/wiki/Non-parametric_statistics

Thanks for this, I've been struggling with data transformation lately.

I've got gas production and consumption data, the result of microbial activity in soils - actual molecular biology will commence as soon as I get all this other stuff cleared up and get off my butt and down into the lab. Anyway, I had to deal with non-normal distribution, lots of zeros, and a bunch of negative numbers; log-transformation took care of the normality problem (at least according to the normality-tests in R and Minitab), but at the cost of some additional transformations that still make me a little uncomfortable; we'll see what the reviewers think of what I've done.

I'll try out this inverse hyperbolic sine transformation, it seems like it could be a more straightforward (i.e. fewer lurking assumptions) approach. Any advice for dealing with the negatives? So far, I've tried separating them out, multiplying by -1 to clear them up, then log-transforming just like the rest of the data, though that makes it difficult to compare to zero (i.e. is the production or consumption of gas I observe significantly different from the null hypothesis of no production at all?) I've also tried adding a "floor" equal to slightly larger than the most negative value; this raises all values to greater-than-zero, and my comparison is then against the null of that floor, rather than zero. Seems to work, but I keep thinking that every manipulation or transformation of my data brings me a little further away from what's actually happening in my soils.

Thanks again, this is really interesting.

Minitab works for me, for normality tests. Re: negative numbers- I don't see what harm it does to add a constant value to all your values before transforming them. This just shifts the center of the distribution- no effect on variances, or on differences between means. I'd do it.

thanks useful entry

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

Program Announcement: I'm Moving

September 1, 2011

I've dropped some hints in the past that my relationship with ScienceBlogs would be...altered. Well, I've decided to leave. Mostly, it had to do with the issue of pseudonymity, although I'm very excited to hang out my own shingle once again. I don't want to rehash the issue of pseudonymity,…

Note to Unions: This Is Not How You Build a Coalition

September 1, 2011

The old saw that 'we hang together or we get hung separately' is a perfect description of how the left has disintegrated into irrelevance. Too often, groups will focus on modest gains for their own narrow constituency, while selling out other allies. Over the long term, each component of the…

Links 8/31/11

August 31, 2011

Links for you. Science: Underground river 'Rio Hamza' discovered 4km beneath the Amazon What do accommodationists do about creationist politicians? I've Been Told You Can Get Flu From the Flu Shot: False! Federal Work Suspension of Leading Arctic Scientist Ended as Investigation of His…

Meet the New New Math, Same As the Old New Math? What We Can Learn from Finland

August 31, 2011

Recently, The New York Times published an op-ed calling for curricular changes in K-12 math education: Today, American high schools offer a sequence of algebra, geometry, more algebra, pre-calculus and calculus (or a "reform" version in which these topics are interwoven). This has been codified by…

Links 8/30/11

August 30, 2011

Links for you. Another Scientist Calls Out Sen. Coburn's Misleading, Juvenile "Report" XMRV: ITS EVERYWHERE! UUUUUGH! ITS IN MY RACCOON WOUNDS! AND MY QIAGEN COLUMNS! Coulter Goes All Science-y in Bid to Disprove Evolution Yet another bad day for the anti-vaccine movement 2011 Antibiotics: Killing…