What if some phylogenies were simply irresolvable? That is, what if, no matter how much data we collected, it would be impossible to reconstruct, with a high level of certainty, an accurate representation of the tree of life? That would suck. A lot. I have mentioned how this can result from long branch attraction or lineage sorting. But are there any taxa where this appears to be a major problem?
Antonis Rokas and Sean Carroll have published an essay in PLoS Biology that addresses the issue of bushes (or irresolvable nodes) in the tree of life. They point out four clades in which no single tree has a high level of support:
The types of data considered are gene sequences, parsimony informative characters (PI-characters), and rare genomic changes (RGCs; insertions, deletions, and other events less common that nucleotide substitution). The four clades shown are (A) human/chimp/gorilla, (B) elephant/sirenian/hyrax, (C) tetrapod/coelocanth/lungfish (the vertebrate tree), and (D) chordate/arthropod/nematode (the metazoan tree). The numbers of each type of data that support each tree are shown along with the percent of data sets that support a particular tree. There is an obvious vertebrate and animal bias here, but that's also where most of the data are.
Rokas and Carroll show that, while some data highly support one phylogeny, none of the phylogenies are consistently supported by any of the larger data sets. Why is this? In all four of these examples, the external branches (those leading to the tips of the tree) are longer than the internal branches (those near the root). These trees are considered "bushy", and this type of topology may lead to incongruent results if not enough mutations accumulate on the short internal branch. It's the internal branch that tells you which two species should cluster together, and which one is the outgroup. Additionally, the same mutations can occur independently on different branches (homoplasy) which may mislead a tree reconstruction algorithm. This is shown in the following figure.
The trees on the left show the ideal scenario: long internal branches and lots of mutations on those branches. The trees on the right are a long branch attraction problem waiting to happen. The vertical hatches are mutations that support the correct tree, while the circles and x's are homoplasies that support an incorrect tree. The lengths of the branches represent the number of changes that have occurred along that particular lineage. When the internal branches get too short, there is more support for incorrect trees than for correct trees due to an excess of homoplasies.
How long must the external branches be to overcome long branch attraction? That depends on the rate of homoplasy. If, for example, 5% of all changes are homoplasies, then the internal branch must be at least 5% as long as the external branches or you will recover the incorrect tree. If the external branches are too long, they can be broken up by adding more species to your sample (i.e., split one of the external branches into an internal branch with two shorter external branches). But that's only possible if there are more species to be sampled, and for some of the examples above there are not.
A common belief amongst systematicians is that by adding more data, one can resolve a bushy tree. Rokas and Carroll point out, however, that extra data may only artificially increase the statistical confidence. Phylogeneticists measure the significance of nodes by randomly resampling their data (with replacement) and reconstructing the tree using this resampled data. The process is repeated thousands of times for each data set, and the number of random samples that support a particular node is reported. That value is known as a bootstrap, and a higher value represents greater confidence in that node. It turns out that these bootstrap values are dependent on the size of one's data set -- the larger the data, the greater the bootstrap value. So, even if a small, medium, and large data set all have 55% of the informative changes supporting one tree and 45% of the changes supporting another tree, the larger data set will give you a tree with higher bootstrap values. Rokas and Carroll argue that high bootstrap support for trees built using whole genome sequences (i.e., the metazoan tree) may lead to a false sense of confidence in that particular topology.
Is all hope lost? Are we ever going to fully understand the tree of life? If the branching order of a clade is nearly impossible to resolve, it can be represented as a multifurcation (as opposed to the typical bifurcating tree). Also, the lengths of the branches provide a lot of information. For example, if there are a lot of short internal branches, we know that a clade experienced a rapid radiation. Rokas and Carroll encourage their readers to see this glass as half full perspective rather than look at the bushiness of certain parts of the tree with chagrin.
Rokas A, Carroll SB. 2006. Bushes in the tree of life. PLoS Biol 4: e352. doi: 10.1371/journal.pbio.0040352
- Log in to post comments
Discouraging, isn't it? Some people argue that we need to pay more attention to indels. Most of them are discarded by the standard programs but indels contain valuable information.
I note that the RGC's seem to be more discriminatory in the Rokas and Carroll paper. Now, if only we knew the true phylogeny so we could find out whether RGC's are better!