I made this video (below the fold) to illustrate the steps involved in making a phylogenetic tree. The basic steps are to:
- Build a data set
- Align the sequences
- Make a tree
In the class that I'm teaching, we're making these trees in order to compare sequences from our metagenomics experiment with the multiple copies of 16S ribosomal RNA (rRNA) genes that we can find in single bacterial genomes. Bacteria contain between 2 to 13 copies of 16S rRNA genes and we're interested in knowing how much they differ from each other. Later, we'll compare the 16S ribosomal RNA genes from multiple species of bacteria to see how much these genes differ between a variety of bacteria.
What's in the video?
The video is about 14 minutes long, so here's a quick description of what it contains.
We begin by getting data and making a data set. Some of our class data come from our metagenomics data sets. We get other data from the NCBI. The video shows how we get all the sequences for all of the 16S rRNA genes from single genomes.
Then we edit the data set to remove the paragraph characters and shorten some of the sequence descriptions. Most of the time we spend doing bioinformatics in real life is spent on editing and formatting data.
After that, we use JalView, a client server program, to connect with a web service at the University of Dundee where the sequences are aligned by ClustalW. I've written about JalView before, now you can see it in action.
Making the tree is the simplest part. When ClustalW aligns the sequences, it also performs the calculations that can guide tree-building. We use the neighbor-joining method in this video. Neighbor-joining trees group sequences together by the number of amino acid or nucleotide differences. The sequences that are most similar are placed most closely together on the tree.
UPDATE: I want to clarify a few things. This video only shows one quick and easy method. The merits of different types of tree-building programs are not discussed.
Other topics that are not included:
- rooted vs. unrooted trees
- building a tree for publication
All those topics would be important if we're going to build a tree and publish it. If we just want to investigate relationships, the method in the video will suffice.
"The sequences that are most similar are placed most closely together on the tree."
Doesn't that make it a phenogram rather than a cladogram?
How do molecular biologists establish that a given base-pair transition reflects phylogeny rather than similarity? Is there really any reason to think that all coding positions are alike with respect to their probablity to change?
I've never understood the fondness that molecular biologists have for neighbor joining.
I had never heard of a phenogram before, and I had to look through a few books to find it. I finally located a definition in Molecular Evolution and Phylogenetics (Nei and Kumar, 2000, Oxford University Press). They define a phenogram as "a tree constructed by the unweighted pair group method (UPGMA)" and go on to state that it represents phenotypic similarity. That makes sense to me.
A cladogram can be constructed from either phenotypic or, as in our case, genotypic information. So, I would say that a phenogram is a type of cladogram.
Now, for the probability of a mutation changing one base to another. The changes are not equally probable. The biochemistry of the bases, and our experience, show that some changes do occur mor often than others. Some evolutionary models, like Maximum Likelihood, and some Bayesian programs take these probabilities into account.
This tutorial is meant for a quick and dirty simple analysis for a class. It doesn't go into all the nitty gritty details of which method is best.
Alex: biologists like models that are supported by experimental data. I'll have to write about these some day, but there have been experiments that tested the ability of tree-building programs to predict the correct trees, where the evolutionary history was known.
It turned out the parsimony was the worst, maximum likelihood, the best, and neighbor-joining was usually good enough and usually gave results that were similar to the maximum likelihood predictions.
Some other reasons are that neighbor-joining is quick and computationally less of a problem than maximum likelihood. If I use large data sets from SNP studies and try to use the maximum likelihood program from Phylip, with bootstrapping, etc., I can crash my Mac OS X computer pretty easily or at least tie it up for a few hours. If I use neighbor joining, I have an answer that's good enough in a short time.
Last, biologists like to use the standard methods that other biologists are using. In microbiology, neighbor joining trees are a standard method.
I suspect you are right that method-choice is a cultural issue. In this case, microbiologists are using an inferior method simply because everyone else does it.
We systematists have our own cultural issues, of course. While we're innovative with phylogenetic algorithms and fancy data tricks, most systematists are at least 5 years behind the curve on molecular bench technology. While the rest of biology is out gathering ESTs and microarrays, we're still PCRing the same stupid ribosomal genes with the same primers used 15 years ago.
I use NJ all the time for preliminary work. It's great for checking for contamination. But no peer-reviewed phylogenetics journal would accept an NJ-only tree these days, for good reason. Distance methods compress most of the information out of your data and greatly limits what you can actually do with a tree.
As for putting your computer out for a few hours? Good grief, that's nothing. If you're going to spend months at the bench getting the data, what's the problem with spending an extra week to use the best available analysis?
1. This isn't intended for making trees that can be published in phylogenetics journals, although it's pretty easy to find publications in microbiology journals that have neighbor-joining trees. This is a quick and easy method that we can use in a classroom.
2. I would be a little careful with those 16S rRNAs. Some bacteria have as many as 14 different copies of rRNA genes, and often, they are all different.
3. I agree with you about using the best method of analysis that you can when you're answering research questions, however methods that are computationally intense are not suitable for a classroom. If you told students that their computer was going to be working on a problem for a few hours, and they wouldn't be able to use it, they would freak!
"If I use neighbor joining, I have an answer that's good enough in a short time."
How do you decide what is "good enough" in the absence of an expected result (phylogeny)? Are there studies that suggest that the results you alluded to earlier (neighbor joining behaves like Maximum Likilihood Methods) actually generalize?
One might also wonder if in those experiments where the answer was "known" and that these happened to be very close to short accumulations of spurts of stochastic mutational change under artificial selection would not the success of Maximum Likihood methods be expected? Accumulation of small random changes over short periods of time would be most likely to fit the "Brownian motion" model underlying Maximum Likihood Methods. Under such circumstances, they would be more likely to have succeeded in capturing the "true" tree. Given more (phylogenetically realistic) time frames for the accumulation of non-independent changes among positions, the Brownian motion model might not so readily apply. Hence, might not one expect other models to potentially more closely reflect "the true phylogeny" under such circumstances?
Hence, the question might be reposed, how do molecular biologists test potential consequences of non-independent positional change? Here I do not simply mean pyrimidine to pyrimidine vs. pyrimidine to purine substitutions as the ratios or factors needed to correct for differing frequencies of such changes might be sample dependent. That is, dependent on which groups of organisms one samples to obtained observed differences in frequency between transitions and transversions. It seems in this case very widely different correction factorss and justifications for their use have been applied.
Not trying to be difficut, but rather am most interested in establishing the fundamental assumptions molecular biologists use to make phylogenetic inferences from their data and then argue that these results generalize to other organisms thereby arriving at "phylogenetically informative", presumably "true", trees.
It has been said that molecular biologists like to use
"standard methods". Certainly, repeatability is essential to good science. However, could it be that standard methods also devolve from the use of simpler and computationally more expedient models and not necessarily the result of conceptual underpinings that generalize across all taxa, because as some have said the answers seem "good enough"?
"2. I would be a little careful with those 16S rRNAs. Some bacteria have as many as 14 different copies of rRNA genes, and often, they are all different."
This is interesting, might you have a citation.
As a P.S., I would like to say that I certainly appreciate your efforts to provide education tools. This is a complex and rapidly evolving area of science and it is difficult to penetrate for the untrained reader. I'm just trying to learn myself and to suggest that outlining the various assumptions being made provides a guide that may get some students stimulated to learn more about the details.
I'll have to answer this in pieces.
First, the question of what's "good enough"?
We are using phylogenetics as one of many tools to look at the 16S ribosomal RNA genes from a diverse set of sequences that were obtained from bacteria living either near a creek or in a forest.
In one experiment, we're comparing the 16S rRNA genes within single genomes to each other to see how similar, or different, they are. For example, B. thuringensis has 14 different rRNA genes. Are they identical, different, how different?
The neighbor joining trees in combination with the multiple alignment data tell us the answer.
The next question, is did we identify our unknown bacteria correctly? Does it fall in the appropriate section of the tree? For this part, we'll compare a sequence from a sample that we studied to a set of diverse 16S rRNA sequences and see where it ends up.
What are our goals for this investigation? Some are listed below:
- 1. That students understand what a tree is trying to show.
- 2. That students know that a tree shows the number of changes, and that ancestral sequences are one side and younger sequences on another.
- 3. That students see that similar species are grouped together.
- 4. That students begin to learn how to interpret the information in a tree.
- 5. That students learn how to evaluate the information in multiple sequence alignments.
- 6. That students learn how to find data and contstruct data sets.
I guess you could cite GenBank or watch the movie. In the movie I show how to go to the NCBI, locate the Genome Project database, look up Bacillus thuringensis., and find all the 16S rRNA genes in the genome.
You may have to look at multiple records, though. I don't remember which strain had 14 16s rRNA genes, all the Bt genomes are different, even within the same species.
and answering more questions:
Are there studies that suggest that the results you alluded to earlier (neighbor joining behaves like Maximum Likilihood Methods) actually generalize?
The ones that I'm familiar with have usually looked within single organisms - like HIV, or phage, E. coli, or large mammals, like cats.
Why do I think we should be able to generalize from one species to another?
1. Nucleic acids are governed by the same chemistry rules wherever they are found.
DNA always behaves like DNA and RNA like RNA. The composition can affect the shape and bending but the not the tendency to mutate.
2. Most organisms have enzymes with similar properties that are involved in both mutations and DNA repair.
Thus, mutation events and mutation repair are likely to be similar in different organisms, not identical, but similar.
Hence, the question might be reposed, how do molecular biologists test potential consequences of non-independent positional change?
I'm confused by the phrase "non-independent positional change."
I'm also confused by the statement below:
Given more (phylogenetically realistic) time frames
A realistic time frame can be quicker than you might think. In fact, it can happen in days, or weeks, depending on the generation time of the organism that you're studying.
Thank you for taking the time to reply.
What I am refering to under the rubrick of "non-independent positional change" is given two presumtively homologous sites (positions) on a DNA/RNA strand, is the probability of mutation and change (perhaps mutation without repair or mutation with incorrect repair) equiprobable from position to position, or can there be some sites (say a series in a convoluted or folded region or critical to a particular function that might be required to remain conservative, or more faithfully repaired, or perhaps repaired in a fashion that it is more likely to a maintain a specific base as opposed to another, or perhaps a stretch that is bound to histones in a way that might differntially shield it from mutation. Hence, they appear to be more conservative with respect to mutation, which itself might act upon more than one site in a non-independent way. Hence, one might envision two different types of non-independent change (that due to position alone and that might accumulate do to multiple sites being acted upon simultaenously, although the effect could not be easily diffentiated.
If such considerations are realistic possibilities, then then one might not necessarily presume that tallying up base-pair differences (steps in the language of parsimony algorithms) would be reflective of the "same" amount of change.
It is my understanding that all resumably "phylogenetically informative" sites are treated as equally capable of change for the purposes of counting steps for most algorithms. Potential "non-independence" in perhaps "better protected streches" might be providing greater insight into stabilizing selection even though they may hardly vary within a group, relative to positions at which more dissimilarity is evident. So the irony might be that highly conserved sites would appear to be less phylogentically informative since they don't vary appreciatively relative to more labile sites, but that when they do change, such change maya be more biologically (and phylogenetically) significant.
I recognize such factors as I am suggesting are largely speculative and poorly understood at best (certainly by me). However, it is unclear to me why molecular phylogenies largely ignore the possibility of highly non-random positional change across a nucleic acid strand and the probability of change at any place on the strand is regarded as equiprobable (or at least this is my present understanding).
I certainly can understand the mathematical tractability an such assumption provides. I guess I am trying to understand how it is concluded that this assumption is warranted, based on what is known about mutation, DNA repair, conformational issues, etc. and presumtive phylogenetic change.
Are there some good primary sources for such questions that most molecular biologists have relied upon?
To follow up on another detail earlier, in general one can adopt a more general notion of a phenogram that might relfect "relationship" based on any phenotypic attributes using any aglomerative or divisive algorithm, not necessarily just UPGMA. Hence, shared primitive states would be wieghted as well as shared derived states. In contrast a cladogram is used to display a tree topology derived from assuming that one is only considering "shared-advance states" in establishing a criterion for propinquity. Sympleisiomorphy would be ruled out.
Ah, I see what you mean now. I'll have time to provide a better answer a couple of hours, for the moment, you might like to take a look at this.
Okay - I can tackle some of the answer.
First, I don't think any phylogenetics programs take the biology into account except where they use matrices, like BLOSUM, that were derived from experimental data concerning the observed probability of one amino acid replacing another.
1. There are hotspots where mutations occur more frequently than others. This is well known. Many of these are connected to mobile DNA like transposons, or DNA with certain kinds of secondary structure, or repetitive DNA.
2. If mutations occur in certain positions, the organism will not be viable -i.e. will not live - so we wouldn't see these mutations.
3. We can't measure revertants, so our numbers will always be somewhat off.
4. A phylogenetic tree is an inference. It's not the actual way something happened. We might not need that high a degree of accuracy.
Third- why isn't this information used in phlylogeny? Beats me. Maybe it's hard to do.
Maybe, also, we don't need this level of detail for very many things. I'm mostly familiar with the use of trees in molecular epidemiology. If we're trying to determine whether a strain of E. coli came from a bottle of organic apple juice, we can find that out with the tools we have on hand. If we're trying to find out whether nurses could have infected children with HIV, we already have the tools at hand.
What everyone seems to be forgetting here is that NJ analysis is NOT phylogenetic analysis! Neighbor-joining is a cluster method based on overall similarity of the data, and in this case, the sequences. It does not reflect the evolutionary history of the group whatsoever; it only tells you how 'similar' your sequences are. Characters in NJ are not evaluated as being ancestral and/or derived, synapomorphies or apomorphies, etc., and hence there is no 'common ancestor' in Neighbor joining (BIG Noooooo)! The length of the branches only represents percent sequence difference; the longer they are, the more 'different' your sequences are. The simplest 'real' phylogenetic method would have been a maximum parsimony, which you should have used, not neighbor joining.
Thanks for setting the record straight, Vazrick.
It sounds like somebody got carried away with "NJ" and other molecular evolutionary jargon ....
by the way "phylogeny" is one of most abused words in biology these days, people may want to go back to the literature of Ernst Haeckel's times when these terms were coined (phyletic, phenetic, phylogeny, phylogenesis, ontogeny, ontogenesis, etc. ...; actually, they were first proposed in German).
I like theoretical methods best when they are supported by experimental data. In the papers that I've read, (and I guess I'll have to blog about them at some point) ML is supported best, NJ, next best, and MP, the least.
Data has a way of causing problems for the most elegant theories.
One example is here.
Being supported best or least does not change the fact that the way data are treated in NJ versus every other method is different. One has to understand the fundamental difference between a phenetic [cluster] analysis (i.e. based on overall similarity) and a cladistic [phylogenetic] analysis, before talking about which is supported and which is not. NJ does not care about evolution at all, and just because the output of the NJ analysis looks like a tree it does not make it a phylogeny. NJ takes two sequences, counts the number of differences between them, divides in by the total length, and vola - gives you a percent difference. In a phylogeny that reflects evolution, however, character changes are important. It is true that parsimony is not always the best, but it is the fastest of all cladistic methods because it assumes the best answer is the one that requires the smallest amount of change and hence discards all other variation. Likelihood methods (including Bayesian inference) are the most powerful ones simply because they work with models of nucleotide substitution that take into account as much variation as theoretically possible, and this is what makes them computationally intensive.
You can bootstrap your data with any tree produced by any method and get support values; this does not change the fact that the NJ tree does not have anything to do with evolutionary history of your group of interest.
Where to begin? I think it great that you are teaching this to students. Since there is some debate back and forth here about how the methods work and the definitions of some of the terms, I thought I would point people to an online chapter from my new Evolution textbook which people should be able to read at
One thing I would like to rebut is a comment from Vazrick about neighbor joining. Neighbor joining is indeed a phylogenetic method in the true sense. It is a method for taking distances and inferring a phylogenetic tree. It is in fact NOT a clustering method, although it does have many similarities to clustering methods.
That looks like a great resource!
Great post! For anyone's who interested, I put together a short post on Mac OS X phylogenetics software a while ago. http://www.jacksofscience.com/biology/mac-os-x-phylogenetics-software-m…
Thanks Kieran! One thing you might want to fix, PHYLIP is not a single program. It is a package with about 30 different programs.
Bacteria contain between 2 to 13 copies of 16S rRNA genes ...
Clarification: Nitrosomonas species have 1 rrn operon.
Yep, I should say most bacteria have between 2 to 13 copies.
Just a sidenote, you might find it considerably easier to manipulate (or import) sequences using eBioX rather than MS Word. ;)
Hi, your clip is good and this make me more understadning on how to build a phylogenetic tree. However, I was trying to make a phylogenetic tree for sulfotransferases (which is a protein alignment). When I search through NCBI, I am not able to find the link for DNA region in FASTA format. So, I click on the "structure" icon and I am only able to find an Acultransferase superfamily where the superfamily included set of conserved domain models, from one or more source databases. Do I need to include all the sources to just a species of microoganism? Or? Can you please suggest.
I'm not sure what your question is.
Are you trying to find the DNA sequences that encode your proteins?
Yes, I am trying to find the DNA sequences that encode the proteins which is something like this "mkvlvlggdg fcgwpcavnl" Am I still able to build the phylogentic tree?
If you have the protein sequences, it would be best to build your phylogenetic tree with the protein sequences.
If you find that the protein sequences are too similar, then, you'll need to find the corresponding sequences of DNA. This can be a bit tricky but if you want to know how to do this, let me know. Describing that process will require an entire blog post.
I tried to use the protein sequences to build the phylogenetic tree, however it is not the same as what I read from the literature review. Do you think any differences will make for using the protein or the DNA sequence to build the phylogenetic tree? Or can you please teach me how to use the corresponding sequence of DNA to build the phylogenetic tree. Many thanks.
You asked: Do you think any differences will make for using the protein or the DNA sequence to build the phylogenetic tree?
Yes, I can think of many reasons why you might get different results.
1. The literature could be wrong.
2. You might be using the wrong outgroup.
3. You might be including paralogous genes by mistake instead of restricting your analysis to orthologs. Orthologs are the equivalent gene in a different organism. Paralogs are different members of the same gene family.
4. If there's alternative splicing, and you compare proteins made by different isoforms
5. If you use a different algorithm to generate the trees. If the published tree was generated by parsimony or multiple likelihood and you're using the neighbor-joining algorithm, you might have different results
6. Last, many algorithms (parsimony and multiple likelihood, for example) produce multiple trees. Determining which tree is correct isn't always a trivial task.
I'll try to write a blog post about this in the coming weeks.
HI...I am quite new to this field of research. I have sequenced my PCR product and followed th BLAST search.
I am confused how I can draw the phylogenetic tree.
Can anyone help the step by step method?
Thanks and regards.
The simplest thing you can do is to do a BLAST search against the sequences you'd like to have in your tree and then, in the top section of the BLAST results, select the link that says "Distance Tree of Results." A tree will appear.
hi! problem with the file formating for MEGA! i was using the MEGA, tried to format the fasta files into the .meg, the data input process was following the rules (at least i think there have nothing wrong with my inputing process), the system converted the fasta into the .meg, but when i opened it and tried to built the tree, the system showed me this: (Aligned sequences must of equal lengths (in line 56)), the line 56 is strain name, it has nothing to do with the sequences. then i deleted the whole line 56 and its sequence with, to see if the problem was just an individual one. i opened the file again and then the same error at same line 56! what should i do next before i go nuts!
Hi Wang !!,
I too got the same problem today. I never encountered this before. I shall try to troubleshoot this and get back soon.
hi i have a problem of creating a model of a phylogenetic tree
Is it possible to compare the protein sequences from infuenza vaccines with the prevalent viruses in a given year?
Yes. You can compare any groups of sequences you want.
The NCBI has a great influenza database here: http://www.ncbi.nlm.nih.gov/genomes/FLU/
You can find sequences by year and download either protein or nucleotide sequences in a FASTA format.
I too got the same problem today. I never encountered this before. I shall try to troubleshoot this and get back soon.
I appreciate to tell me how to read and interpretate phylogenetic tree of viral strains of avian influenza.
What are the numbers at the node? what do they mean?
what are the underlined 0.1 at the bottom of the tree and what does it mean?
How to say that the two viral strains are homologus?
Hi! I was having trouble with the "Aligned sequences must of equal lengths" message too, and I just had to delete a "?" that was in one of my sequence names. So maybe strange characters in names cause the problem. Hope this helps.
I really wanted to ask,
for NJ tree,
what the branch length reflect to?
Is it the longer the branch, the more evolution has happened?and the spesies can be said as derived?
NJ tree will have unequal of tree terminal right?
and what node points position reflect to?
I have search for two whole day the manual to interprete the tree but I only manage to find the method of the tree is reconstructed which I couldnt understang..I just want to know how to interprete the tree..Can you please give any guide or manual? Thank you very much
The branch length corresponds to the number of differences between the sequences, so you could think of it as reflecting the amount of evolution.
I think this site from Berkeley has a pretty clear guide for interpreting trees: http://evolution.berkeley.edu/evolibrary/article/side_0_0/evo_05
The how to build a phylogenetic tree video no longer seems to be on VIMEO. Is it possible to get a copy directly from you?
I would like to use it in my class.
I have got stuck at one point in constructing the ML tree using Phylip. I generated a tree using ML method but have not been able to get the tree distances appearing on it.
Could you please give me step wise guidance to get the tree distances on the ML tree created by Phlip3.6.
FYI: I replaced the video embed code - it works now.
Dulan: As far as branch distances, try viewing your tree in a tree-viewing program like NJ plot or JalView. There is a setting in those two programs for showing branch distances.
You might also try some other programs for generating ML trees. Mega is pretty user friendly.