Phylogeny Friday - 1 December 2006

By evolgen on December 1, 2006.

What if some phylogenies were simply irresolvable? That is, what if, no matter how much data we collected, it would be impossible to reconstruct, with a high level of certainty, an accurate representation of the tree of life? That would suck. A lot. I have mentioned how this can result from long branch attraction or lineage sorting. But are there any taxa where this appears to be a major problem?

Antonis Rokas and Sean Carroll have published an essay in PLoS Biology that addresses the issue of bushes (or irresolvable nodes) in the tree of life. They point out four clades in which no single tree has a high level of support:

The types of data considered are gene sequences, parsimony informative characters (PI-characters), and rare genomic changes (RGCs; insertions, deletions, and other events less common that nucleotide substitution). The four clades shown are (A) human/chimp/gorilla, (B) elephant/sirenian/hyrax, (C) tetrapod/coelocanth/lungfish (the vertebrate tree), and (D) chordate/arthropod/nematode (the metazoan tree). The numbers of each type of data that support each tree are shown along with the percent of data sets that support a particular tree. There is an obvious vertebrate and animal bias here, but that's also where most of the data are.

Rokas and Carroll show that, while some data highly support one phylogeny, none of the phylogenies are consistently supported by any of the larger data sets. Why is this? In all four of these examples, the external branches (those leading to the tips of the tree) are longer than the internal branches (those near the root). These trees are considered "bushy", and this type of topology may lead to incongruent results if not enough mutations accumulate on the short internal branch. It's the internal branch that tells you which two species should cluster together, and which one is the outgroup. Additionally, the same mutations can occur independently on different branches (homoplasy) which may mislead a tree reconstruction algorithm. This is shown in the following figure.

The trees on the left show the ideal scenario: long internal branches and lots of mutations on those branches. The trees on the right are a long branch attraction problem waiting to happen. The vertical hatches are mutations that support the correct tree, while the circles and x's are homoplasies that support an incorrect tree. The lengths of the branches represent the number of changes that have occurred along that particular lineage. When the internal branches get too short, there is more support for incorrect trees than for correct trees due to an excess of homoplasies.

How long must the external branches be to overcome long branch attraction? That depends on the rate of homoplasy. If, for example, 5% of all changes are homoplasies, then the internal branch must be at least 5% as long as the external branches or you will recover the incorrect tree. If the external branches are too long, they can be broken up by adding more species to your sample (i.e., split one of the external branches into an internal branch with two shorter external branches). But that's only possible if there are more species to be sampled, and for some of the examples above there are not.

A common belief amongst systematicians is that by adding more data, one can resolve a bushy tree. Rokas and Carroll point out, however, that extra data may only artificially increase the statistical confidence. Phylogeneticists measure the significance of nodes by randomly resampling their data (with replacement) and reconstructing the tree using this resampled data. The process is repeated thousands of times for each data set, and the number of random samples that support a particular node is reported. That value is known as a bootstrap, and a higher value represents greater confidence in that node. It turns out that these bootstrap values are dependent on the size of one's data set -- the larger the data, the greater the bootstrap value. So, even if a small, medium, and large data set all have 55% of the informative changes supporting one tree and 45% of the changes supporting another tree, the larger data set will give you a tree with higher bootstrap values. Rokas and Carroll argue that high bootstrap support for trees built using whole genome sequences (i.e., the metazoan tree) may lead to a false sense of confidence in that particular topology.

Is all hope lost? Are we ever going to fully understand the tree of life? If the branching order of a clade is nearly impossible to resolve, it can be represented as a multifurcation (as opposed to the typical bifurcating tree). Also, the lengths of the branches provide a lot of information. For example, if there are a lot of short internal branches, we know that a clade experienced a rapid radiation. Rokas and Carroll encourage their readers to see this glass as half full perspective rather than look at the bushiness of certain parts of the tree with chagrin.

Rokas A, Carroll SB. 2006. Bushes in the tree of life. PLoS Biol 4: e352. doi: 10.1371/journal.pbio.0040352

More like this

Discouraging, isn't it? Some people argue that we need to pay more attention to indels. Most of them are discarded by the standard programs but indels contain valuable information.

I note that the RGC's seem to be more discriminatory in the Rokas and Carroll paper. Now, if only we knew the true phylogeny so we could find out whether RGC's are better!

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

This is a Good-bye Post

January 16, 2009

This is the final post ever at evolgen. It was a fun 4+ years, the last three spent at ScienceBlogs, but it has come time for me to close up shop. When I first got into blogging, I did it as a way to share what was on my mind to the few people who would read what I had to say (usually in topics…

Mendel's Garden #27 - Call for Submissions

January 2, 2009

Mendel's Garden is the original genetics blog carnival. The next edition will be hosted by Jeremy at Another Blasted Weblog. If you would like to submit a blog post to be included in the carnival, send an email to Jeremy (jcherfas at mac dot com). The carnival should be posted within the next few…

Eric Lander Teaches?

December 20, 2008

John Hawks points out that Eric Lander has been appointed to co-chair Obama's Council of Advisers on Science and Technology along with science adviser John Holdren and Nobel Laureate Harold Varmus. Here's how the AP article describes Lander: Lander, who teaches at both MIT and Harvard, founded the…

The Implementation of Molecular Evolution for the Masses

December 18, 2008

A couple of years ago, there was talk in the bioblogosphere about getting the general public interested in bioinformatics and molecular evolution: Amateur bioinformatics? Lowering the Ivory Tower with Molecular Evolution Molecular Evolution for the Masses The idea was inspired by the findings of…

Do people still use microarrays?

December 17, 2008

Larry Moran points to a couple of posts critical of microarrays (The Problem with Microarrays): Why microarray study conclusions are so often wrong Three reasons to distrust microarray results Microarrays are small chips that are covered with short stretches of single stranded DNA. People…