Picking the Hive Mind's Brain: Any Good, Fast Genetic Distance Calculators?

By mikethemadbiologist on November 25, 2008.

I suppose I'll have to start at the beginning. One of the things we do in the microbiome project is analyze microbial communities by assessing the diversity of 16S DNA sequences*. In other words, we want to know what species or 'operational taxonomic units'--OTUs--are present and how many of that OTU are there.

Operationally, we are taking a bunch of DNA sequences and placing them in categories. This, of course, is more difficult than it sounds.

One of the issues with 16S is that many (nine to be exact) regions of the gene are hypervariable--they evolve so quickly that these regions can't be compared between distantly related strains. Of course, you could compare them, but you couldn't make any evolutionary sense of them because sites will have changed multiple times and you can't see that (e.g., an A -> T might really be an A -> G -> C -> T)**.

To make sure that we don't compare apples and oranges in our sequence, when we align sequences, we place a lot of gaps in the post-alignment sequences (this is a cartoon example):

ACGT might become AC--GT
CGAT --CGAT

So, onto classification. One way that we can place sequences into groups is through distance-based methods. For example, we can create categories where there is less than three percent divergence among sequences (three percent divergence means that in a sequence of 100 nucleotides, three differ between the two sequences***)****.

The problem stems from the gaps--you don't want to look at the gaps when calculating genetic distance (again, this is to avoid the multiple hits and non-homology problems). The 'gold standard' program that is often used in microbiome studies is DNADIST (which doesn't look at gaps). However, DNADIST is slow. There is no way it can handle the amounts of sequence generated; it's too mathematically intensive (while it hasn't been decided yet, my guess is that the part of the Human Microbiome Project that looks at 250 people will generate at least 50,000,000 reads. Even if only a fraction of these are unique, it's still too damn many).

There are faster methods (k-mer searching approaches) but these have gap issues. My colleagues and I are kludging together some workarounds, but does anyone know of a fast distance calculator that doesn't have trouble with gaps?

*The 16S gene is found in all bacteria and can be used as a 'barcode' to identify bacterial species, without having to grow the bacteria.

**There's a second issue which is that 16S doesn't encode for a protein--what matters is the secondary structure of the RNA molecule. This means that changes in one part of the molecule can select for changes in another part to ensure binding or formation of various loop structures.

I hate 16S.

***Calculating sequence divergence can be done by a variety of methods, but now I'm getting depressed.

****Of course, this isn't trivial either. Do we mean that all sequences must be less than three percent different from each other, or some, or only one?

More like this

Greengenes does a really good job at processing large numbers of 16S sequences, and aligning them without an explosion of indels. But it still uses DNADIST (DNAML) to calculate distances.

It sounds like your main problem is essentially that you don't want to calculate 50,000,000^2 distances using DNADIST?

No matter how fast your code, calculating 2.5x10^15 individual distances will take a *lot* of time. A much faster approach might be to first cluster the sequences (there's some fast ways to do that without calculating all pairwise distances), calculate pairwise distances only among the closest pairs, then estimate the largest distances by triangle inequality.

Sure, you'll lose some resolution between the most distantly related strains, but those distances aren't that useful for tree building anyway.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

Universities Can Agree On All Hate Speech Except Antisemitism

More by this author

Program Announcement: I'm Moving

September 1, 2011

I've dropped some hints in the past that my relationship with ScienceBlogs would be...altered. Well, I've decided to leave. Mostly, it had to do with the issue of pseudonymity, although I'm very excited to hang out my own shingle once again. I don't want to rehash the issue of pseudonymity,…

Note to Unions: This Is Not How You Build a Coalition

September 1, 2011

The old saw that 'we hang together or we get hung separately' is a perfect description of how the left has disintegrated into irrelevance. Too often, groups will focus on modest gains for their own narrow constituency, while selling out other allies. Over the long term, each component of the…

Links 8/31/11

August 31, 2011

Links for you. Science: Underground river 'Rio Hamza' discovered 4km beneath the Amazon What do accommodationists do about creationist politicians? I've Been Told You Can Get Flu From the Flu Shot: False! Federal Work Suspension of Leading Arctic Scientist Ended as Investigation of His…

Meet the New New Math, Same As the Old New Math? What We Can Learn from Finland

August 31, 2011

Recently, The New York Times published an op-ed calling for curricular changes in K-12 math education: Today, American high schools offer a sequence of algebra, geometry, more algebra, pre-calculus and calculus (or a "reform" version in which these topics are interwoven). This has been codified by…

Links 8/30/11

August 30, 2011

Links for you. Another Scientist Calls Out Sen. Coburn's Misleading, Juvenile "Report" XMRV: ITS EVERYWHERE! UUUUUGH! ITS IN MY RACCOON WOUNDS! AND MY QIAGEN COLUMNS! Coulter Goes All Science-y in Bid to Disprove Evolution Yet another bad day for the anti-vaccine movement 2011 Antibiotics: Killing…