Sibley and Ahlquist's 'Tapestry'

Well, I really was very much enthused, inspired and uplifted by the many kind and supportive comments so many of you added to the previous article. Thank you all. So enthused, in fact, that I couldn't help myself, and took time out of lunch breaks and so on to produce 'ticking over' material for Tet Zoo.

i-aa3cca34677a858989c4fc8db3e2ff8a-tapestry-crop-resized-490-Feb-2010.jpg

As some of you know, I'm hard at work on a major project concerning birds at the moment (and, to the people who made speculations that such project might be lucrative... let me assure you that none of the science writing I do can be described as 'lucrative' - no money is involved in this project, for example. Do people realise how much work scientists do for free?). In view of this, I dug out the old image you see here (it was produced in 1997). What does it mean? Here's an excerpt from the text I'm currently working on...

i-9b650e13f351f3c7cb1b56431771131e-Sibley_&_Ahlquist_cover_Feb-2010.jpg

What appeared to be an empirical approach to the study of the avian tree came to the fore during the 1980s and 90s when Charles Sibley and colleagues used DNA-DNA hybridization to analyse avian relationships. Sibley and Ahlquist (1990) produced a mostly resolved phylogeny, dubbed the 'Tapestry', for over 1100 species.

They supported a basal divergence between Eoaves (ratites and tinamous) and Neoaves, and broke the latter down into six major assemblages: Galloanserae (gamebirds and waterfowl), Turnicae (buttonquails), Picae (woodpeckers and kin), Coraciae (hornbills, trogons, rollers and kin), Coliae (collies) and Passerae (everything else, from cuckoos, parrots and pigeons to cranes, waders, raptors, herons, pelicans and passerines). While Sibley and Ahlquist's (1990) suggestions did much to inspire new work, their DNA-DNA hybridization technique was entirely phenetic and their conclusions were not always supported by their data (Harshman 1994, 2007).

There's a lot more that could be said, but this was mostly an excuse to re-use the old 'Tapestry' picture used at top. Feel free to discuss among yourselves. Lots more on bird phylogeny to appear here in time.

Refs - -

Harshman, J. 1994. Reweaving the Tapestry: what can we learn from Sibley and Ahlquist (1990)? Auk 111, 377-388.

Harshman, J. 2007. Classification and phylogeny of birds. In Jamieson, B. G. M. (ed) Reproductive Biology and Phylogeny of Birds. Science Publishers, Inc. (Enfield, NH), pp. 1-35.

Sibley, C. G. & Ahlquist, J. A. 1990. Phylogeny and Classification of Birds. New Haven: Yale University Press.

More like this

Yay! for "ticking over" material.

I hadn't commented yet on the "Death of..." post, but add me to the number of people who were alarmed at the headline and profoundly relieved that this only meant time out and not forever. Totally get needing the time...but it would be heartbreaking to lose this blog for good.

Re. the tapestry -- love the picture. Shame their conclusions weren't as good as they could be.

By Luna_the_cat (not verified) on 12 Feb 2010 #permalink

I couldn't help myself, and took time out of lunch breaks and so on to produce 'ticking over' material for Tet Zoo.

Yay! Many thanks for that!

I'm hard at work on a major project concerning birds

I don't know what it is but I have my suspicions...

Is that an ivory-billed woodpecker I see in the bird tree?

Yes, Sibley and Ahlquist's reach certainly exceeded their grasp, yet their tapestry still has a lot of support from less phylogenetic ornithologists and among amateur birders.

By the way, I also wanted to extend my compliments to you for all your hard work on this blog, Darren. It's one of several biology blogs I check on frequently. And for none of these favorite blogs do I feel cheated if there is no new material. I can wait. There is plenty for me to read and keep up with in science, but whenever my favorite blogger adds an interesting piece I will be there to devour it.

Do people realise how much work scientists do for free?

Only scientists and perhaps their spouses do, barely even their parents. Other people would never guess. One word: page charges.

Sibley and Ahlquist (1990) produced a mostly resolved phylogeny [â¦] their DNA-DNA hybridization technique was entirely phenetic

Well, I wouldn't even say they produced a phylogeny. They produced a phenetic hypothesis and sold it as a phylogenetic hypothesisâ¦

By David MarjanoviÄ (not verified) on 12 Feb 2010 #permalink

take as long of a break as you need whenever you need..you owe nothing..thanks for everything..we`ll be here..
sluggo

We are all glad to see a post again, I think you may just post with less frequency as you finish your project.

"Do people realise how much work scientists do for free?"
that is an perennial issue regarding science.

Now identify all the others

I can't identify all, but here are a few (in no particular order) that I'm fairly confident I have down to correct species:

pintail duck
Lady Amherst's pheasant
saddle-billed stork
grey heron
northern bald ibis
whooping crane
great crested grebe
Atlantic puffin
great northern loon
bateleur
emperor penguin
greater flamingo
great white pelican

The bustard is an Afrotis species, and the mousebird is some Colius (not Urocolius) species. The frigatebird, the tropicbird and the pitta, at least, are probably identifiable to species level too (though not by me).

I like that drawing for the alertness in the birds. In illos like that, they usually look very static, but not in that one.

By Mike from Ottawa (not verified) on 12 Feb 2010 #permalink

I see a The Saddle-billed Stork (Ephippiorhynchus senegalensis)towards the end

I have to second Diego's comment (#4) above -- good job on managing such a great science blog!

I also wanted to mention a few other refs that may be of interest:

Shannon J. Hackett, et al. A Phylogenomic Study of Birds Reveals Their Evolutionary History Science 320, 1763 (2008); DOI: 10.1126/science.1157704

For some recent work on Parulidae (new world warblers): http://www.birds.cornell.edu/evb/Irby.htm

As mentioned in the page linked above, a number of groups have been coordinating efforts to refine relationships within Emberizidae. I'm not sure how much of that has been published yet -- it might still be "work in progress".

Dartian: I'm pretty confident that the parrot is a scarlet macaw. Although I guess it could be a green-winged macaw, depending if the clear face is an artifact of the resolution.

By Sebastian Marquez (not verified) on 12 Feb 2010 #permalink

I also hadn't commented on the previous thread but like so many others was momentarily aghast that the 'death' might be permanent. Glad to see that you can't stay away Darren. and hope that you are able to continue educating your readers as well as earning a crust and looking after your family!

I have always been a big fan of your artwork and get the feeling you probably have lots of these gems hanging around which would make excellent images for the blog.

Identifications I can try to make to add to Dartian's:

No 10: Burrowing Owl
No 11: Indian Pitta?
No 22: Not Red Billed Tropicbird (as no back barring) so must be Red-tailed or White-tailed
No 23: Masked Booby?

By RStretton (not verified) on 12 Feb 2010 #permalink

20 years since that book was published ... my, has time flown?

Darren,

Nice image sample of the "Tapestry". When I was a grad student, I had the full, species-level Tapestry taped around three walls of my office. And it was mighty impressive.

And I'd like to complain about you and David Marjanovic using "phenetic" as a swear word. There's nothing wrong with phenetic methods, per se. Like any methods, they produce good estimates of phylogeny to the extent that the data don't violate the assumptions of the method. UPGMA will give you a decent phylogeny if there is a very good molecular clock. Sibley & Ahlquist had some problems, not least of which was the failure of that molecular clock in quite a few spots. Then again, many of their most interesting results, particularly within passerines, have held up. And one other triumph selected at random: their charadriiform phylogeny was spot on.

One quibble: Eoaves was originally named by S&A to include paleognaths and Galloanserae, a novel -- and incorrect -- clade. And Neoaves was Eoaves' sister group, all other gbirds. When that didn't pan out, they tried to redefine the terms to include Galloanserae within Neoaves. Fortunately, everyone ignored them (since the names would have become junior synonyms of Palaeognathae and Neognathae). Eoaves died a deserved death, but Neoaves remains.

By John Harshman (not verified) on 12 Feb 2010 #permalink

There's nothing wrong with phenetic methods, per se. Like any methods, they produce good estimates of phylogeny to the extent that the data don't violate the assumptions of the method.

Well, yeah. The assumptions of the method of using phenetics to do phylogenetics are that there's little enough homoplasy in the data. And that's among the very possibilities that a phylogenetic analysis is supposed to test. Just assuming them is a spectacularly bad idea, as lots of empirical examples show, for instance every single case of long-branch attraction.

UPGMA will give you a decent phylogeny if there is a very good molecular clock.

Indeed â there just almost never is a halfway decent molecular clock.

Then again, many of their most interesting results, particularly within passerines, have held up. And one other triumph selected at random: their charadriiform phylogeny was spot on.

I think we can agree that precladistic morphological phylogenetics (an art rather than a science) was occasionally even worse than using molecular phenetics as phylogenetics. :-)

a novel -- and incorrect -- clade

Long-branch attraction of the outgroups to the long stem of Neoaves.

By David MarjanoviÄ (not verified) on 13 Feb 2010 #permalink

I still disagree. Long branch attraction is hardly a special problem of phenetic methods. Let's remember that it was first demonstrated using parsimony. Nor do all distance methods assume that homoplasy is rare; depends on the distance measure -- everything from Jukes-Cantor distance on up attempt to correct for homoplasy.

In some circles it's enough to sniff "feh, phenetic" in order to ignore an entire analysis. That's convenient, but it's lazy. Case in point: I tell you that S&A's phylogeny of charadriiforms was exactly right, and you sniff that precladistic morphological phylogenetics was even worse. Even worse than exactly right?

I will point out that there are phenetic (=distance?) methods that don't assume a clock, and they solve many of UPGMA's problems using the same data. Even so, within some clades, like passerines, there seems to be a good enough clock, and in others, like anseriforms, there's only one problematic node even using UPGMA. (The best published study of anseriform phylogeny is still Madsen et al. 1988, using UPGMA.)

And the problem with Eoaves wasn't exactly long-branch attraction. It was extrapolation of distance measures long past their useful range. UPGMA in S&A's results was actually more vulnerable to short-branch attraction.

By John Harshman (not verified) on 13 Feb 2010 #permalink

Cool old stuff...

... but what is this bird project?

Long branch attraction is hardly a special problem of phenetic methods.

No, but they're more susceptible to it than phylogenetic methods, even parsimony.

That's convenient, but it's lazy.

To the contrary. The only good reason there ever was for using phenetic instead of phylogenetic methods for doing phylogenetics is that the phenetic ones are a lot faster. But with today's computers, calculation time hardly ever is an issue anymore; parsimony is fast enough, and maximum likelihood and Bayesian inference are usually feasible, too.

phenetic (=distance?)

Yes. Phenetics measures the similarity ( = 1 - distance); phylogenetics counts the shared derived character states (whether transformed into likelihoods or not).

Similarity can be synapomorphic, symplesiomorphic, or homoplastic. Phylogenetic methods sometimes fail to distinguish these correctly; phenetic ones don't even try.

and you sniff that precladistic morphological phylogenetics was even worse. Even worse than exactly right?

I was making a general point, not one specific to this example.

The best published study of anseriform phylogeny is still Madsen et al. 1988

Why?

UPGMA in S&A's results was actually more vulnerable to short-branch attraction.

That's the same thing â it means that the long branches wander off together, so that the short branches are left.

By David MarjanoviÄ (not verified) on 13 Feb 2010 #permalink

No, but they're more susceptible to it than phylogenetic methods, even parsimony.

Several studies have found this not to be the case, particularly when the distances used match the model under which the characters evolved. What do you have on that? And by the way, I object to the claim that phenetic methods aren't phylogenetic.

Likelihood doesn't count shared derived character states. It counts all states, and applies the same computations to all.

The best published study of anseriform phylogeny is still Madsen et al. 1988

Why?

Because it gets the phylogeny closest to right. Unfortunately I can't show you my criteria for "right", since my study (which I hope we will all agree is right by definition) isn't yet published.

Finally, I deny that long-branch and short-branch attraction are at all the same thing. They may have the same result, but the reasons are wholly different. UPGMA is subject to SBA simply because it clusters the most similar taxa; all you need for a mistake is enough variation in evolutionary rate. LBA, on the other hand, requires non-additive distances, i.e. that the measured distances are increasingly underestimates of the true distance as true distance increases.

By John Harshman (not verified) on 13 Feb 2010 #permalink

Oh, and by the way, just what is your mysterious bird project? Or is it secret?

By John Harshman (not verified) on 13 Feb 2010 #permalink

Thanks to all for comments. Interesting stuff on phenetic techniques, thanks John. What you say has made me re-think my opinion on this subject.

Birds in 'Tapestry' pic are meant to be (from left to right): Pintail, Lady Amherst's pheasant, Andalusian hemipode, Ivory-billed woodpecker, Lilac-breasted roller, Speckled mousebird, Coccyzus cuckoo, generic macaw, generic hummingbird, Burrowing owl, Banded pitta, Australian raven, Desert sparrow, Rufous turtle dove, Black korhaan, Purple gallinule, Whooping crane, wader (Redshank?), Atlantic puffin, Bateleur, Great crested grebe, Red-tailed tropicbird, Masked booby, Purple heron, Bald ibis, Greater flamingo, White pelican, Saddle-billed stork, frigate bird, Emperor penguin, Great northern diver and Wandering albatross.

What do you have on that?

I'll try to look for references tomorrow.

And by the way, I object to the claim that phenetic methods aren't phylogenetic.

I can only repeat: "Similarity can be synapomorphic, symplesiomorphic, or homoplastic. Phylogenetic methods sometimes fail to distinguish these correctly; phenetic ones don't even try." They have no qualms about clustering taxa based on their symplesiomorphies if those are just numerous enough.

By David MarjanoviÄ (not verified) on 14 Feb 2010 #permalink

David,

Your criticism is not of phenetic methods, but of phenetic methods that assume a molecular clock. Those are the only ones that would cluster based on symplesiomorphies. (That's the short branch attraction I was talking about). Other methods, like least squares, do no such thing.

I can only repeat: any method is phylogenetic if it's intended to estimate phylogeny, and it will correctly estimate that phylogeny to the extent that its assumptions fit the data. Parsimony will perform better than UPGMA under some circumstances, but UPGMA will perform better than parsimony under others. I agree that the former case is more common than the latter, though of course parsimony is inapplicable to DNA hybridization. And UPGMA is hardly the only phenetic method. Cases in which, say, neighbor-joining outperforms parsimony are more common, and fairly easy to simulate.

By John Harshman (not verified) on 14 Feb 2010 #permalink

I have a question about phenetics. We always hear it's inferior to cladistics because the latter only uses shared DERIVED characters, while phenetics uses any shared metric. But the only way PAUP knows a character is derived is via comparison to the outgroup. And it only holds true for a part of the tree anyway. What's a symplesiomorphy among Theropoda basally could be a synapomorphy within Tyrannosauroidea. Also, more and more authors are recognizing that it's best to split characters into multiple states, or even have them fully quantified as TNT allows. This is similar to many traditional phenetic analyses. So really, the only difference is that cladistic analyses root their trees automatically based on the outgroup, while you'd have to manually root a phenetic tree, no?

By Mickey Mortimer (not verified) on 14 Feb 2010 #permalink

No. Even if you fix the root in advance, phenetic analyses can cluster taxa based, ultimately, on character states that parsimony would optimize as symplesiomorphies.

Phenetics doesn't look at the individual characters. It transforms the character matrix into a distance matrix and then works on the distance matrix. Cladistics... parsimony at least transforms the character matrix into a difference matrix and then works on the difference matrix, which shows which OTUs differ in which characters â something that's ignored in phenetics, where the sum total of the distance between two OTUs is used instead.

Models of evolution can be applied to both. In the case of parsimony, you get max. likelihood or, AFAIK, Bayesian inference. In the case of phenetics, you get neighbor-joining, AFAIK.

Test if the <pre> tag works:

test
test
test

By David MarjanoviÄ (not verified) on 14 Feb 2010 #permalink

Test utterly failed. It works on Pharyngula...

By David MarjanoviÄ (not verified) on 14 Feb 2010 #permalink

I have to say that considering maximum likelihood a variety of parsimony is perverse, and would scandalize everyone from Farris to Felsenstein. Then again, it's possible to consider parsimony a variety of maximum likelihood, e.g. the "no common mechanism" model of Tuffley & Steele.

It might perhaps be better to say that parsimony (at least when characters are reversible) considers not synapomorphies, per se, but characters that are best optimized as changing state at internal nodes. Now, if a state changes, one of them is apomorphic, though you don't know which until you root the tree.

What is supposed to do?

By John Harshman (not verified) on 15 Feb 2010 #permalink

What is supposed to do?

Assuming you meant "What is <pre> supposed to do?", it lets you post a phylogenetic tree without resorting to HTML entity shenanigans (replace " " with "&nbsp; " globally) like this (which I hope works):

(This was originally posted by David MarjanoviÄ in a comment on Pharyngula -- I'm not putting it in a blockquote because I'm uncertain that it will work like that)

--Dinosauria
    |--Ornithischia
    `--Saurischia
          |--Sauropodomorpha
          `--Theropoda
              |--Coelophysoidea
              `--Neotheropoda
                    |--Ceratosauria
                    `--Tetanurae
                        |--Spinosauroidea
                        `--Neotetanurae
                              |--Allosauroidea
                              `--Coelurosauria

...have to break the line here, or it'll get unreadable...

Coelurosauria
  |--Tyrannosauroidea
  `--Maniraptoriformes
      |--Ornithomimosauria
      `--Maniraptora
            |--Oviraptorosauria
            `--Eumaniraptora
                |--Aves
                `--Deinonychosauria
                      |--Dromaeosauridae
                      `--Troodontidae

By Owlmirror (not verified) on 15 Feb 2010 #permalink

Just to clarify: that was done by a search on "[space char][space char]" and replace with "&nbsp;[space char]"

By Owlmirror (not verified) on 15 Feb 2010 #permalink

Might as well test a subsegment of it:

--Dinosauria
    |--Ornithischia
    `--Saurischia
          |--Sauropodomorpha
          `--Theropoda
              |--Coelophysoidea

By Owlmirror (not verified) on 15 Feb 2010 #permalink

*sigh*

Yeah, I was afraid that would happen.

Take two, with explicit line breaks (<br>) added:

--Dinosauria
    |--Ornithischia
    `--Saurischia
          |--Sauropodomorpha
          `--Theropoda
              |--Coelophysoidea

By Owlmirror (not verified) on 15 Feb 2010 #permalink

"Phenetics doesn't look at the individual characters. It transforms the character matrix into a distance matrix and then works on the distance matrix."

Aha, well that is quite the difference then. That's the thing that should be emphasized when contrasting the methods, as equal distances aren't guaranteed to be based on the same differences.

By Mickey Mortimer (not verified) on 15 Feb 2010 #permalink

That's the thing that should be emphasized when contrasting the methods, as equal distances aren't guaranteed to be based on the same differences.

I've been trying and failing to make sense of that comment. If equal distances are based on the same differences, i.e. if distances AB and AC are the same differences, then the distance BC will equal zero, and B and C are the same taxon. If, on the other hand, they're based on different differences, there will be a distance BC, and you will be able to use the combination of the three distances to determine the branch lengths of a 3-taxon tree. There are a lot of misconceptions and canards about distance methods, most of them coming from those with a philosophical prejudice against anything other than parsimony.

By John Harshman (not verified) on 15 Feb 2010 #permalink

"I've been trying and failing to make sense of that comment."

Your explanation makes sense, but I can't see how it and David's statement about phenetics being more prone to symplesiomorphies can both be true. Assuming that despite being combined into a distance matrix, the combination of inter-taxon distances allows the change within any character to be mapped onto the most parsimonious network, I don't see how it's different from an unrooted cladistic analysis. Any taxa that are clustering successively closer to the most plesiomorphic taxon (based on having more symplesiomorphies, by definition) would actually/also be clustering successively further from the apex (based on having less apomorphies) if we define the most plesiomorphic taxon as the outgroup. Any other taxa clustering based on different plesiomorphies would actually be clustering based on apmorphies, since any plesiomorphy not found in (or evolved convergently from) the outgroup is a local apomorphy. Correct?

By Mickey Mortimer (not verified) on 15 Feb 2010 #permalink

If, on the other hand, they're based on different differences, there will be a distance BC, and you will be able to use the combination of the three distances to determine the branch lengths of a 3-taxon tree.

The branch lengths in terms of distance -- not in terms of number of state changes (such as substitutions).

I'll try to look the rest up later. I was once explained where exactly the difference lies, but can't quite remember it.

By David MarjanoviÄ (not verified) on 16 Feb 2010 #permalink

I can't see how it and David's statement about phenetics being more prone to symplesiomorphies can both be true.

I can't interpret that either. I think David is confusing distance methods with a subset of distance methods, those that assume a clock, like UPGMA. If you assume a clock, the most similar taxa will be clustered. Thus if two species have lots of plesiomorphies and few apomorphies, they will go together. However, distance methods, like neighbor-joining, that don't assume a clock don't have this problem. However, thre are many differences between parsimony methods and distance methods. Nor is a UPGMA cluster necessarily going to give you the same unrooted tree as parsimony would. The lengths of terminal branches are irrelevant to parsimony, but often crucial to UPGMA, for example. Consider this tree:

B C
A_____|_|_______D

Suppose there's enough information on the central branch, and not obscured by the other branches, for parsimony to get the correct topology. But UPGMA will cluster B and C. There's no way to turn that into the correct unrooted tree. (Note that this is true for methods that assume a clock, not for all distance methods.)

The branch lengths in terms of distance -- not in terms of number of state changes (such as substitutions).

Distances are measured in number of state changes, or at least can be transformed into state changes if you multiply percent difference by sequence length. Under some circumstances, parsimony gives a better estimate of the true number of state changes; then again, under other circumstances, distance gives a better estimate.

By John Harshman (not verified) on 16 Feb 2010 #permalink

OK, that didn't work. All the leading spaces in the first line of the tree went away. Why? And how could it be fixed?

In the mean time, let's turn the tree sideways:

A
|
|
|
--B
|
--C
|
|
|
|
D

Not so pretty, but at least the labels go where they should.

By John Harshman (not verified) on 16 Feb 2010 #permalink

Not sure if this will work:

      B C
A_____|_|_______D

All the leading spaces in the first line of the tree went away. Why?

Because all multiple spaces are collapsed into a single space when HTML is displayed

And how could it be fixed?

By using workarounds like the substitutions I made above. Or by using the <pre> tag, which would first require Darren changing his blog settings so as to allow the <pre> tag (it is currently simply stripped out from the comment when posted).

By Owlmirror (not verified) on 16 Feb 2010 #permalink

Not sure if this will work:

â â B C
A_____|_|_______D

Sort of, and I imagine it could be fixed with a little tweaking. What did you do? It looks like spaces to me, and yet it can be cut and pasted.

By John Harshman (not verified) on 16 Feb 2010 #permalink

ââââââââ â B C
A_____|_|_______D

By John Harshman (not verified) on 16 Feb 2010 #permalink

Regarding your example, you're saying the correct tree is ((AB)(CD)), but UPGMA would produce (D(A(B,C))? I think this is actually an example of what I meant by "equal distances aren't guaranteed to be based on the same differences". Let's say your example is changed slightly so that D and A are each exactly the same distance from the B+C clade, but based on different chracters. According to you, B and C would form a clade because they're the most similar. Which would then leave us with (D,A(B,C)), right? Since it's not like D or A are similar at all to each other, and neither is more similar to B+C than the other. Sure D and A are more dissimilar from each other than either is to B+C, but I can't see how that could be reflected in tree topology (though you could tell by their distances of course). So then the branch leading to the B+C clade would be based on different characters depending on whether you're comparing D or A, right? Or am I completely misunderstanding?

By Mickey Mortimer (not verified) on 16 Feb 2010 #permalink

What did you do? It looks like spaces to me, and yet it can be cut and pasted.

The HTML standard, and the UTF-8 character set, permits blank spaces of specific widths in non-fixed-width fonts - &nbsp; (non-breaking space), &ensp; (en-space; a space half as wide as an em), and &emsp; (em-space; a space as wide as the height of the font).

..............................................
]&nbsp;[, ]&ensp;[, ]&emsp;[
..............................................

Will render as:

..............................................
] [, ] [, ] [
..............................................

Note the (slightly) varying widths, narrowest to widest.

I wrote the lines as:

&emsp; &nbsp; &emsp;B C
A_____|_|_______D

Thus:

     B C
A_____|_|_______D

By Owlmirror (not verified) on 16 Feb 2010 #permalink

Actually, there are more spacing options than I first realized; see:

http://en.wikipedia.org/wiki/Space_%28punctuation%29

Redoing the tree a bit, using an &ensp; in the middle instead of the &nbsp;, shifts the letters over the middle of the lower verticals instead of slightly offset:

     B C
A_____|_|_______D

By Owlmirror (not verified) on 16 Feb 2010 #permalink

Distances are measured in number of state changes, or at least can be transformed into state changes if you multiply percent difference by sequence length.

Sorry, that's true once you know the proportion of parsimony-uninformative characters.

I tried to find more on the differences between distance methods and parsimony, but I had misremembered the PAUP 3.1 handbook⦠what I had remembered was about how a heuristic search for MPTs works.

Anyway⦠forget symplesiomorphies. Can't characters that are parsimony-uninformative because one state is autapomorphic for a single OTU contribute to increasing the distance between that OTU and its closest relative? I guess they can do so to such an extent as to result in the attraction of such a long branch to the root. Parsimony is immune to this particular effect*, because all parsimony-uninformative characters are simply ignored by the analysis.

* Of course, both distance methods and parsimony are vulnerable to other sources of long-/short-branch attraction/repulsion. Neighbor-joining can overcome this to some extent by adding a model of evolution, but if I can find a model, I'd use likelihood and/or Bayesian inference in the first place⦠and parsimony is less vulnerable to heterotachy than methods that use a model. It's complicated. :-)

Regarding your example, you're saying the correct tree is ((AB)(CD)), but UPGMA would produce (D(A(B,C))?

Certainly. UPGMA (like for that matter neighbor-joining) makes rooted trees, not unrooted ones like parsimony, and the root is in the longest branch.

Note the (slightly) varying widths, narrowest to widest.

Not on my screen. 6 pixels each, Opera 10.10 for Mac. In a couple of hours I'll try at home, IE8 for WinXP⦠but I don't think that'll change anything.

Thus:

That example has two leading spaces in the first line.

Redoing the tree a bit, using an &ensp; in the middle instead of the &nbsp;, shifts the letters over the middle of the lower verticals instead of slightly offset:

Except there are no leading spaces at all in that example.

By David MarjanoviÄ (not verified) on 17 Feb 2010 #permalink

Not on my screen. 6 pixels each, Opera 10.10 for Mac.

Huh. I use Firefox on Win32, but I just checked now with Opera 10.10 on the same machine, and I see that it is indeed not rendered correctly -- it appears to treat ensp and emsp as equivalent to an ordinary space!

IE6SP1, also on the same machine, does a half-assed job -- ensp and emsp appear to be the same width, slightly wider than a space.

By Owlmirror (not verified) on 17 Feb 2010 #permalink

Mickey,

Sure, it's possible for A and D to both be the same distance from BC, and of course those distances are based on different characters, because the major components of distance are the terminal branches in each case. I don't see why that's a problem. Now what tree would be produced? I don't recall that UPGMA has a procedure for dealing with this, but if it did, you would have a polytomy: clades AB, C, and D all sprouting independently from a basal node. You would most certainly not see a CD clade, if that's what you're getting at. Now as for the branch leading to BC, it's not based on characters at all, and can't be construed as representing any characters. In UPGMA, it's just a result of an algorithm, nothing more.

David,

The proportion of parsimony-informative characters has nothing to do with changing distances to transformations. Some transformations are autapomorphic, that's all. Sure, autapomorphies are part of a distance. Of course they're also part of a patristic distance estimated under parsimony. I don't see the distinction. And sure, in UPGMA, long terminal branches result in error; that's because they violate the assumptions of the UPGMA method. (Though again, this is actually, because of the way the algorithm works, short-branch attraction rather than the converse.) I repeat that this is a feature not of distance methods but of distance methods that assume a clock. Please remember that distinction.

Are neighbor-joining trees intrinsically rooted? I'm not sure about this. They are generally considered to be an approximation (and quite a good one) of least-squares trees, which are not rooted. They may be considered rooted by convention. Check that PAUP manual.

Finally, I would be interested in a reference for that claim about heterotachy. I'm dubious.

Owlmirror:

âââââââââââ B C
A_____|_|_______D

OK, that worked, at least in preview. It was difficult because the "&emsp" characters disappear into spaces -- in the original text -- as soon as you try to preview. And oddly, I had to use 11 of them. Some em space -- only about half an average character width.

By John Harshman (not verified) on 17 Feb 2010 #permalink

And the width in preview again doesn't match the width when I see it posted. Very frustrating. I suspect different users are also seeing different shapes. So far, I don't see a single version of this simple tree where the B and C are right above their branches. Post #42 comes closest for me, just slightly to the right. #48 is a bit farther to the right. All the others are way to the left. How about you?

By John Harshman (not verified) on 17 Feb 2010 #permalink

Using Firefox 3.5.7 on Win32, #45 is dead-on perfect, while #44 is just a few pixels to the left. I see the same in IE7 (not 6 or 8) on WinXP. On Opera, as noted above, it doesn't work at all.

Using Firefox 3.5.7, your #42 and #48 have the "B C" so far to the right that they are past the final "D" on the second line.

Using IE6, nothing looks right -- mine are all too far to the left, while yours are too much to the right, but not as far to the right as in Firefox (they appear between the right vertical bar and the "D" at the end on the second line).

Which browser are you using, on which OS?

By Owlmirror (not verified) on 17 Feb 2010 #permalink

I'm using Firefox 3.5.7 with MacOS 10.5.8.

By John Harshman (not verified) on 17 Feb 2010 #permalink

Not on my screen. 6 pixels each, Opera 10.10 for Mac.

As I suspected, IE8 for WinXP works just fine, places everything correctly, and distinguishes all three space widths.

Now as for the branch leading to BC, it's not based on characters at all, and can't be construed as representing any characters. In UPGMA, it's just a result of an algorithm, nothing more.

Well, bad.

Sure, autapomorphies are part of a distance. Of course they're also part of a patristic distance estimated under parsimony. I don't see the distinction.

That distance is used for building the tree in distance methods. In parsimony, it's completely ignored; you can read it from the cladogram afterwards.

Are neighbor-joining trees intrinsically rooted? I'm not sure about this. [...] Check that PAUP manual.

I'm probably wrong, I now remember seeing unrooted-looking (star-like) neighbor-joining trees...

Unfortunately there's no manual for PAUP* 4.0... probably because there is no PAUP* 4.0 yet, only the 10th beta version which hasn't been superseded since 2003. The manual that exists dates from 1993 and is for PAUP 3.1, which wasn't able to do anything but parsimony (PAUP* 4.0b10 does UPGMA, probably WPGMA, neighbor-joining, parsimony, and max. likelihood for molecular data). There's a command reference for PAUP* 4.0b1, but of course it doesn't explain what neighbor-joining is. I'll try the MacClade manual tomorrow, but I don't think it'll help either...

Finally, I would be interested in a reference for that claim about heterotachy.

Bryan Kolaczkowski & Joseph W. Thornton: Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous, Nature 431, 980 -- 984 (21 October 2004)

Abstract with citations removed:

All inferences in comparative biology depend on accurate estimates of evolutionary relationships. Recent phylogenetic analyses have turned away from maximum parsimony towards the probabilistic techniques of maximum likelihood and bayesian Markov chain Monte Carlo (BMCMC). These probabilistic techniques represent a parametric approach to statistical phylogenetics, because their criterion for evaluating a topologyâthe probability of the data, given the treeâis calculated with reference to an explicit evolutionary model from which the data are assumed to be identically distributed. Maximum parsimony can be considered nonparametric, because trees are evaluated on the basis of a general metricâthe minimum number of character state changes required to generate the data on a given treeâwithout assuming a specific distribution. The shift to parametric methods was spurred, in large part, by studies showing that although both approaches perform well most of the time, maximum parsimony is strongly biased towards recovering an incorrect tree under certain combinations of branch lengths, whereas maximum likelihood is not. All these evaluations simulated sequences by a largely homogeneous evolutionary process in which data are identically distributed. There is ample evidence, however, that real-world gene sequences evolve heterogeneously and are not identically distributed. Here we show that maximum likelihood and BMCMC can become strongly biased and statistically inconsistent when the rates at which sequence sites evolve change non-identically over time. Maximum parsimony performs substantially better than current
parametric methods over a wide range of conditions tested, including moderate heterogeneity and phylogenetic problems
not normally considered difficult.

The problem is that model-based methods always have too few rate categories in the model, usually just four (because increasing that number increases calculation time exponentially). Parsimony doesn't assume that any character evolves at the same speed as any other.

Quote from p. 983:

With real sequences, we do not know the true number of branch length partitions, so imposed models will usually use either too many or too few branch length parameters. For many sequences, the actual number of branch length categories may approach the number of sites; under these conditions, the true one-category-per-site likelihood model is formally equivalent to maximum parsimony25.

...and the second-to-last sentence:

At present, we recommend reporting nonparametric analyses along with parametric results and interpreting likelihood-based inferences with the same caution now applied to maximum parsimony trees.

Make sure you read the supplementary information, too!

There are more papers on this topic, I'll try to find some tomorrow.

Ref. 25, which I haven't read, is:

C. Tuffley & M. Steel (1997): Links between maximum likelihood and maximum parsimony under a simple model of site substitution, Bull. Math. Biol. 59, 581 -- 607.

And the width in preview again doesn't match the width when I see it posted. Very frustrating.

I don't know how well preview works now, but it used to be that it was rather dangerous to click "post" from the preview screen; instead you had to go back and click "post" there, or risk losing all sorts of formatting.

Post #42 comes closest for me, just slightly to the right.

It's very far to the right in IE8, farther than D.

#45 is dead-on perfect, while #44 is just a few pixels to the left.

Almost exactly the same for me.

By David MarjanoviÄ (not verified) on 17 Feb 2010 #permalink

Well, bad.

Why?

That distance is used for building the tree in distance methods. In parsimony, it's completely ignored; you can read it from the cladogram afterwards.

In what way does this constitute a problem?

I find it difficult to believe that the number of rate categories would be a problem. The problem should arise if the rates at sites vary over the tree in a non-correlated manner. In this way you arrive at the "no common mechanism" model of Tuffley & Steel (your ref. 25), in which there is a separate rate parameter for each site *on each branch*; that is, no site predicts the rate at any other site, and no branch predicts the rate of the same site on any other branch. This is the model that's formally equivalent to parsimony. But one feature of this model is that it isn't guaranteed to be consistent even if the sequences have evolved according to that model. I have to wonder at the real-world applicability of this model. But I'll have a look at Kolaczkowski & Thornton.

By John Harshman (not verified) on 17 Feb 2010 #permalink

I'm using Firefox 3.5.7 with MacOS 10.5.8.

I wonder if the extreme difference we're seeing has something to do with the fact that Windows uses Arial and Mac uses Helvetica?

I don't know how well preview works now, but it used to be that it was rather dangerous to click "post" from the preview screen; instead you had to go back and click "post" there, or risk losing all sorts of formatting.

Yes, comment preview still mangles HTML entities on Tet Zoo.

Pharyngula changed such that formatting is no longer lost, but I suspect that was a blog-specific upgrade.

By Owlmirror (not verified) on 17 Feb 2010 #permalink

For those wondering - I don't have the first clue on how to modify my blog settings such that fonts and formatting and so on will appear differently, nor do I have the time to tinker with such things. Sorry if this is a problem. Furthermore, efforts to make changes to some of the blog settings have been futile in recent months - it is usually just about impossible for me to modify such things as the blogroll and free module (everything you see in the left-hand column). This has been looked at a few times by the technical people but the problem persists. This explains, by the way, why new additions to the blogroll (e.g., Mickey Mortimer, Jim Robbins) are few and far between: the publishing platform is just unworkable.

But one feature of this model is that it isn't guaranteed to be consistent even if the sequences have evolved according to that model.

Of course. Parsimony is and stays more vulnerable to long-branch attraction than maximum likelihood and Bayesian inference -- at the same time as being less vulnerable to heterotachy.

More later.

This has been looked at a few times by the technical people but the problem persists.

What if you ask your SciBlings instead of the ScienceBorg ITiots?

(That's a term I recently encountered in the endless thread of Pharyngula, and I just love it.)

By David MarjanoviÄ (not verified) on 18 Feb 2010 #permalink

What if you ask your SciBlings instead of the ScienceBorg ITiots?

I've done this, no-one can help. The tech people have fixed the problem two or three times, but it always reoccurs.

Of course. Parsimony is and stays more vulnerable to long-branch attraction than maximum likelihood and Bayesian inference -- at the same time as being less vulnerable to heterotachy.

Maybe not that either. See for example Spencer, M., E. Susko, and A. J. Roger.2005. Likelihood, parsimony, and heterogeneous evolution. MBE 22:1161-1164.

Abstract: Evolutionary rates vary among sites and across the phylogenetic tree (heterotachy). A recent analysis suggested that parsimony can be better than standard likelihood at recovering the true tree given heterotachy. The authors recommended that results from parsimony, which they consider to be nonparametric, be reported alongside likelihood results. They also proposed a mixture model, which was inconsistent but better than either parsimony or standard likelihood under heterotachy. We show that their main conclusion is limited to a special case for the type of model they study. Their mixture model was inconsistent because it was incorrectly implemented. A useful nonparametric model should perform well over a wide range of possible evolutionary models, but parsimony does not have this property. Likelihood-based methods are therefore the best way to deal with heterotachy.

By John Harshman (not verified) on 18 Feb 2010 #permalink

Ah, the Wandering Albatross.. Diomedea exulans. The only bird on Darren's list for which the scientific name pops into my head. And a wonderful soaring sort of name it is; the kind of binomial such a bird might rejoice in. I was disappointed not to see a Hoopoe there, though, since Upupa epops is another name I would have known, and is another of my favourite binomials, in this case for its hilarity, symmetry and general memorableness.

All the phenetic/phylogenetic discsussion seems clever stuff, and I hope to understand it one day.

Glad you were heartened, again to blog, Darren ;-D

Maybe not that either.

Kolaczkowski & Thornton 2004 was the start of a literature battle that took at least 2 years, and I haven't read most of those papers... I forgot to look for the probably latest papers today. Maybe tomorrow.

Which paper, BTW, are Spencer et al. referring to? Because K & T 2004 didn't propose any "mixture model".

A useful nonparametric model should perform well over a wide range of possible evolutionary models, but parsimony does not have this property. Likelihood-based methods are therefore the best way to deal with heterotachy.

Non sequitur.

By David MarjanoviÄ (not verified) on 18 Feb 2010 #permalink
Well, bad.

Why?

A branch that isn't supported by the data is not a problem in a phylogenetic hypothesis?

(In a graphic representation of a distance matrix, it's obviously not a problem. The problems start only when we treat such a representation as a phylogenetic hypothesis.)

That distance is used for building the tree in distance methods. In parsimony, it's completely ignored; you can read it from the cladogram afterwards.

In what way does this constitute a problem?

Suppose a tree (A(B,C)). Further suppose that B has gone off and added a large number of autapomorphies, so it's now very distant from C. It's at a high risk to be found outside the (A,C) clade.

This can happen with parsimony, too, but only if it happens to characters that are already parsimony-informative based on other taxa, namely when the autapomorphies of B overwrite autapomorphies of (B,C); it can't happen if the autapomorphies are simply additions to the data matrix. With distance methods it can. With distance methods that contain a model it still can, just not that easily.

I find it difficult to believe that the number of rate categories would be a problem. The problem should arise if the rates at sites vary over the tree in a non-correlated manner.

Which of course happens in the real world. Even the latest few programs for molecular dating take this into account.

By David MarjanoviÄ (not verified) on 18 Feb 2010 #permalink

Oops, the second line ("Why?") should be indented once.

By David MarjanoviÄ (not verified) on 18 Feb 2010 #permalink

Thanks for the explanation, John. I would agree with David that the BC branch not being based on any characters is bad because assuming we're trying to recover and represent a phylogeny, all phylogenetic branches in the real world exist as one or more mutations/characters.

By Mickey Mortimer (not verified) on 18 Feb 2010 #permalink

David,

I find some odd statements in your reply.

A branch that isn't supported by the data is not a problem in a phylogenetic hypothesis?

(In a graphic representation of a distance matrix, it's obviously not a problem. The problems start only when we treat such a representation as a phylogenetic hypothesis.)

Who says that branches in distance trees aren't supported by the data? You are assuming that the only way to find support for a branch is to place a specific change on that branch (and presumably by parsimony). That's circular reasoning. I take a different position. I think that if a method is intended to determine a phylogenetic branching pattern, it produces a phylogenetic hypothesis. Whether that hypothesis is supported is another question, and depends on whether the assumptions of the method are valid in that particular case. This has nothing necessarily to do with the placement of particular mutations on particular branches.

Suppose a tree (A(B,C)). Further suppose that B has gone off and added a large number of autapomorphies, so it's now very distant from C. It's at a high risk to be found outside the (A,C) clade.

Your example of a problem is, once again, limited to UPGMA and other methods that assume an ultrametric tree (a clock). You consistently seem to ignore the fact that most distance methods do not assume a clock.

Which of course happens in the real world. Even the latest few programs for molecular dating take this into account.

I don't think you're correct here. What the programs you're talking about actually do is account for rate variation across the tree; but there is a common rate parameter across all sites at any local region of the tree. This is not at all what we were talking about previously, in which different sites on the same branch have uncorrelated rates.

Mickey,

We can separate phylogeny into at least three elements: 1) the branching pattern, 2) the lengths of those branches, and 3) the particular mutations that make up the branches. What we most often want is 1. Knowing 2 or 3 are not required for that -- in fact it's been shown by simulation that many methods are much more robust in estimating 1 than 2 or 3. Parsimony, for example, can often get 1 right when it's way off on both 2 and 3. There is no requirement that any method even be able to estimate 3 in order to find 1. Now if we're trying to get an estimate of 3, that's another matter. No distance method will do that for you. In that case you're stuck with parsimony or likelihood. I suspect, however, that you would view likelihood as flawed in the same way as distance methods are if you knew the details.

By John Harshman (not verified) on 20 Feb 2010 #permalink

I was going to say I don't care about branch lengths, but thinking further, I suppose I do in the sense that longer branch lengths between nodes are more likely to be real. Still, forming a topology based on branch length instead of pattern is not valuable to me, since it's a taxonomy that does not necessarily reflect phylogeny. I don't see how any method could not have to deal with 3 in order to find 1, as every phylogenetic method requires an input of data which ultimately reflects which mutations distinguish OTUs. As an aside, I think 3 is far too neglected lately. I've seen 30 iterations of the Theropod Working Group matrix, but have yet to see any discuss the support for various basic clades like Maniraptora. And I might indeed view likilihood as flawed in the same way, though I've never read a description of its algorithm.

By Mickey Mortimer (not verified) on 20 Feb 2010 #permalink

Mickey,

Well of course then parsimony is the only phylogenetic method, because you're setting up the criteria for such a method so as to exclude all methods other than parsimony. But are those valid criteria? I suggest that the proper criterion is simply this: will this method produce the correct tree using these data? You should understand also that regardless of how you get a topology, you can proceed to map characters onto it by a different method. And I agree that 3 is too neglected.

Parsimony, by the way, does assume a model of evolution, and it's rather an odd one. It assumes that you can tell nothing about the evolution of one character by looking at the evolution of another, i.e. that there are no common evolutionary parameters. It implicitly assumes that a character is as likely to change on a long branch as on a short one. And it assumes that all branches are fairly short. Often, these assumptions are good enough. But don't imagine they aren't there.

By John Harshman (not verified) on 21 Feb 2010 #permalink