Jonathan Eisen has a paper in PLoS One describing software that he’s developed for analyzing 16S rRNA sequence data. Rather than walk through everything, I’ve decided this post will be different: I’m going to treat this as a manuscript that I’m reviewing (there will be some differences, and it won’t be as formally written as a ‘real’ review). But I wanted to phrase some ‘real’ questions, as opposed to extensively distilling it for the ‘lay’ reader so non-scientists could see what we really criticize each other about (hint: it’s not whether evolution is real). Onto the review.
Eisen has a good summary of what the program does:
…it describes automated software for analyzing rRNA sequences that are generated as part of microbial diversity studies. The main goal behind this was to keep up with the massive amounts of rRNA sequences we and others could generate in the lab and to develop a tool that would remove the need for “manual” work in analyzing rRNAs….
The basics of the software are summarized below: (see flow chart too).
- Stage 1: Domain Analysis
- Take a rRNA sequence
- blast it against a database of representative rRNAs from all lines of life
- use the blast results to help choose sequences to use to make a multiple sequence alignment
- infer a phylogenetic tree from the alignment
- assign the sequence to a domain of life (bacteria, archaea, eukaryotes)
- Stage 2: First pass alignment and tree within domain
- take the same rRNA sequence
- blast against a database of rRNAs from within the domain of interest
- use the blast results to help choose sequences for a multiple alignment
- infer a phylogenetic tree from the alignment
- assign the sequence to a taxonomic group
- Stage 3: Second pass alignment and tree within domain
- extract sequences from members of the putative taxonomic group (as well as some others to balance the diversity)
- make a multiple sequence alignment
- infer a phylogenetic tree
From the above path, we end up with an alignment, which is useful for things such as counting number of species in a sample as well as a tree which is useful for determining what types of organisms are in the sample.