Indels in Phylogenies: How Should We Treat Them?

This will be something of a technical post, but I've decided to pick the hivemind's brain. In some projects I'm involved with, we're generating lists of SNPs for a bunch of bacterial strains using Illumina.

For a given strain, anywhere from ten to forty percent of the SNPs are indels (actually 'SNP' is a bit of a misnomer because we can detect small multi-nucleotide insertions and deletions). Here's the question: is there any way to use maximum likelihood methods with indels? I could just use parsimony methods, and treat the indels as characters, but I don't want to lose information about the molecular evolutionary model. For the substitutions, it's pretty clear that they violate the de facto assumptions of parsimony (equal rates and transition frequencies across all sites), so I would like to use a method that incorporates a molecular model.

I would note that as it gets really cheap to scan entire microbial genomes for SNPs, this will be a problem we have to grapple with, so strap on yer thinkin' caps and come up with a solution!

Seriously, any ideas?

Update: While I was on vacation, I stumbled across this paper that uses DNAML to deal with gaps. While it's definitely an improvement, there are two problems: 1) it treats a gap which is larger than one character (e.g., "---") as multiple characters--each gapped site is treated as a character; 2) DNAML isn't very computationally fast (although maybe this modified version DNAML could be implemented in fastDNAML?).

More like this

I'm not certain you can have all four. Let's start at the beginning. Just to review, one way to examine the human microbiome--the organisms that live on and in us--is extract the DNA from a biological sample (usually something from a person that is slimy, stinky, or both, such as feces or a…
While I continue my work-induced blog coma, here's a guest post from Luke Jostins, a genetic epidemiology PhD student and the author of the blog Genetic Inference, delivering a fairly scathing critique of a recent whole-genome sequencing paper based on Life Technologies' SOLiD platform. McKernan…
A couple of weeks ago, I came across this discussion thread "Will you stop using 454?" It's a pretty good thread--not much to disagree with there, although, from my perspective, it missed a key point (I'll get to that). But my answer is simple: I already have. My work focuses primarily on…
Kai Wang is a postdoctoral fellow at the Center for Applied Genomics, Children's Hospital of Philadelphia and an author on numerous genome-wide association studies. He left this lengthy comment as a response to my recent post on this comment by McClellan and King in Cell, and I felt it warranted…

A program called Prankster was developed by Ari Loytynoja in Nick Goldman�s group at EBI in the UK. It's designed to handle indels.

Articles describing it can be found here:

1. http://www.sciencemag.org/cgi/content/full/320/5883/1632?maxtoshow=&HIT…

2. http://www.pnas.org/content/102/30/10411.full?sid=0c046f3e-2c18-4311-ae…

3. http://www.pnas.org/content/102/30/10557.full

The program can be downloaded for Mac or Windows here: http://www.ebi.ac.uk/goldman-srv/prank/prankster/ (actual FTP site is http://www.ebi.ac.uk/goldman-srv/prank/src/prankster/windows/ ).

After installing, I used a 2-state model for my ITS-rDNA data set using the model generating website: http://www.ebi.ac.uk/goldman-srv/prank/models/. It seems like several of these might also be useful for aligning concatamerized multi-locus data sets that evolve at different rates.

Goldman�s group and the whole EMBL-EBI website has a pretty impressive set of tools, which I�ve barely browsed.

This doesn't answer your question but I doubt very much that the quality of your data justifies using maximum likelihood or any other sophisticated algorithm.

Many of your SNPs, including indels, are likely to be cloning and/or sequencing artifacts of one sort or another. You might as well stick with difference methods which are not only faster but have the advantage of allowing for various gap penalties to deal with indels. They also tend to swamp out errors.

I think David's right. Your best bet, based on what's out there, is to let the gaps be treated as missing, create a binary character state matrix coding for presence/absence of individual gaps, and run a partitioned analysis in MrBayes or wherever.

parsimony doesn't have to assume equal frequencies across all sites. you can use weighted parsimony, for example, assigning diffs at 3rd positions less weight than those at 2nd positions, transitions less weight than transversions, etc. this is not new. check out any of many papers from the 90s in MPE or MBE, or Evolution, etc., when parsimony was getting more attention before maximum likelihood hit the pavement. for gaps, the extension of an existing gap would get less weight than the opening of a new gap, etc. there are workable solutions for the use of parsimony.

and, to state the obvious, also make sure you are stone confident in your sequence data, and that you have the best alignment possible. visually examine any computed alignment, and tweak by hand as necessary to be sure that any gaps are real and make sense.

Why not select those SNPs that are conserved (use UCSC's conservation db) and assign a weight to them in your analysis? Then a non-parametric method could be used for comparing ranks.

By Phil Stafford (not verified) on 03 Oct 2008 #permalink

I have never read this blog before, so my answer might be quite insipid and out of place.....But I'm sure I read quite a lot about different methods for solving this problem in the book "Bioinformatics and Molecular Evolution" (Higgs/Attwood) some years ago. It is a reoccurring problem when calculating evolutionary distance between related sequences. I think it would at least be worth having a quick browse in the book at the library, or maybe through Amazon's online reader.