Indels in Phylogenies: How Should We Treat Them?

By mikethemadbiologist on October 2, 2008.

This will be something of a technical post, but I've decided to pick the hivemind's brain. In some projects I'm involved with, we're generating lists of SNPs for a bunch of bacterial strains using Illumina.

For a given strain, anywhere from ten to forty percent of the SNPs are indels (actually 'SNP' is a bit of a misnomer because we can detect small multi-nucleotide insertions and deletions). Here's the question: is there any way to use maximum likelihood methods with indels? I could just use parsimony methods, and treat the indels as characters, but I don't want to lose information about the molecular evolutionary model. For the substitutions, it's pretty clear that they violate the de facto assumptions of parsimony (equal rates and transition frequencies across all sites), so I would like to use a method that incorporates a molecular model.

I would note that as it gets really cheap to scan entire microbial genomes for SNPs, this will be a problem we have to grapple with, so strap on yer thinkin' caps and come up with a solution!

Seriously, any ideas?

Update: While I was on vacation, I stumbled across this paper that uses DNAML to deal with gaps. While it's definitely an improvement, there are two problems: 1) it treats a gap which is larger than one character (e.g., "---") as multiple characters--each gapped site is treated as a character; 2) DNAML isn't very computationally fast (although maybe this modified version DNAML could be implemented in fastDNAML?).

More like this

A program called Prankster was developed by Ari Loytynoja in Nick Goldmanï¿½s group at EBI in the UK. It's designed to handle indels.

Articles describing it can be found here:

1. http://www.sciencemag.org/cgi/content/full/320/5883/1632?maxtoshow=&HIT…

2. http://www.pnas.org/content/102/30/10411.full?sid=0c046f3e-2c18-4311-ae…

3. http://www.pnas.org/content/102/30/10557.full

The program can be downloaded for Mac or Windows here: http://www.ebi.ac.uk/goldman-srv/prank/prankster/ (actual FTP site is http://www.ebi.ac.uk/goldman-srv/prank/src/prankster/windows/ ).

After installing, I used a 2-state model for my ITS-rDNA data set using the model generating website: http://www.ebi.ac.uk/goldman-srv/prank/models/. It seems like several of these might also be useful for aligning concatamerized multi-locus data sets that evolve at different rates.

Goldmanï¿½s group and the whole EMBL-EBI website has a pretty impressive set of tools, which Iï¿½ve barely browsed.

This is a problem I've encountered many times -- and still haven't come up with a good idea. There's this paper (http://www.ploscompbiol.org/article/info:doi%2F10.1371%2Fjournal.pcbi.1…). I haven't read it yet but it looks like it may be of use to you.

I've used indels as characters in Bayesian analyses before. You have to partition the data but it works and I think it made the analysis much stronger.

I think Sarah Palin will totally know this one.

This doesn't answer your question but I doubt very much that the quality of your data justifies using maximum likelihood or any other sophisticated algorithm.

Many of your SNPs, including indels, are likely to be cloning and/or sequencing artifacts of one sort or another. You might as well stick with difference methods which are not only faster but have the advantage of allowing for various gap penalties to deal with indels. They also tend to swamp out errors.

I think David's right. Your best bet, based on what's out there, is to let the gaps be treated as missing, create a binary character state matrix coding for presence/absence of individual gaps, and run a partitioned analysis in MrBayes or wherever.

parsimony doesn't have to assume equal frequencies across all sites. you can use weighted parsimony, for example, assigning diffs at 3rd positions less weight than those at 2nd positions, transitions less weight than transversions, etc. this is not new. check out any of many papers from the 90s in MPE or MBE, or Evolution, etc., when parsimony was getting more attention before maximum likelihood hit the pavement. for gaps, the extension of an existing gap would get less weight than the opening of a new gap, etc. there are workable solutions for the use of parsimony.

and, to state the obvious, also make sure you are stone confident in your sequence data, and that you have the best alignment possible. visually examine any computed alignment, and tweak by hand as necessary to be sure that any gaps are real and make sense.

Why not select those SNPs that are conserved (use UCSC's conservation db) and assign a weight to them in your analysis? Then a non-parametric method could be used for comparing ranks.

I have never read this blog before, so my answer might be quite insipid and out of place.....But I'm sure I read quite a lot about different methods for solving this problem in the book "Bioinformatics and Molecular Evolution" (Higgs/Attwood) some years ago. It is a reoccurring problem when calculating evolutionary distance between related sequences. I think it would at least be worth having a quick browse in the book at the library, or maybe through Amazon's online reader.

thanks

I follow your site constantly and offers a very good share. I expect continued share

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

Program Announcement: I'm Moving

September 1, 2011

I've dropped some hints in the past that my relationship with ScienceBlogs would be...altered. Well, I've decided to leave. Mostly, it had to do with the issue of pseudonymity, although I'm very excited to hang out my own shingle once again. I don't want to rehash the issue of pseudonymity,…

Note to Unions: This Is Not How You Build a Coalition

September 1, 2011

The old saw that 'we hang together or we get hung separately' is a perfect description of how the left has disintegrated into irrelevance. Too often, groups will focus on modest gains for their own narrow constituency, while selling out other allies. Over the long term, each component of the…

Links 8/31/11

August 31, 2011

Links for you. Science: Underground river 'Rio Hamza' discovered 4km beneath the Amazon What do accommodationists do about creationist politicians? I've Been Told You Can Get Flu From the Flu Shot: False! Federal Work Suspension of Leading Arctic Scientist Ended as Investigation of His…

Meet the New New Math, Same As the Old New Math? What We Can Learn from Finland

August 31, 2011

Recently, The New York Times published an op-ed calling for curricular changes in K-12 math education: Today, American high schools offer a sequence of algebra, geometry, more algebra, pre-calculus and calculus (or a "reform" version in which these topics are interwoven). This has been codified by…

Links 8/30/11

August 30, 2011

Links for you. Another Scientist Calls Out Sen. Coburn's Misleading, Juvenile "Report" XMRV: ITS EVERYWHERE! UUUUUGH! ITS IN MY RACCOON WOUNDS! AND MY QIAGEN COLUMNS! Coulter Goes All Science-y in Bid to Disprove Evolution Yet another bad day for the anti-vaccine movement 2011 Antibiotics: Killing…