I’m not certain you can have all four. Let’s start at the beginning.
Just to review, one way to examine the human microbiome–the organisms that live on and in us–is extract the DNA from a biological sample (usually something from a person that is slimy, stinky, or both, such as feces or a nasal swab), and then amplify, using PCR, the 16S rRNA gene. 16S is found in all archaea and eubacteria, and can serve as a barcode that can be used to identify microbes*. This approach means you don’t have to culture the bacteria and count things on agar plates (assuming they can be cultured in the first place); you can just crack open the cells and amplify the 16S, and sequence a bunch of 16S DNA molecules**.
Pyrosequencing is viewed as a great leap forward: compared to traditional Sanger sequencing, one can generate roughly two to three orders of more reads (sequences of 16S DNA molecules) for a comparable price.
However, there are some drawbacks to pyrosequencing (often referred to as ’454′ after the manufacturer). First, 454, even with the soon-to-be-released upgrade, only yields around 450 nucleotides (450 ‘bp’). That’s not a lot of information–looking at some data we generated, classifying sequences to genus isn’t that successful. To put this in perspective, most, though not all, bacterial genera contain bacterial species that whose most recent common ancestor can be ten millions of years in the past (of course, if you’re a creationist, I suppose this isn’t a problem at all).
But that’s not the real problem with 454 16S work. The real problem is with ‘indels’ which is sort for insertions and deletions. Due to sequencing errors***, often an extra base is added or deleted in homopolymeric runs (e.g., CCC is accidentally read as either CC or CCCC).
So moving onto phylogenetic analysis (reconstructing evolutionary histories). One commonly method of determining if two samples of microbes are different uses phylogenetics–for example, you might want to know if people with eczyma have different skin microbial communities from those who don’t have eczyma. (Phylogenetics is also used in one method of taxonomically classifying sequences). Indels cause a real problem for phylogenetic analysis. Actually, it’s the ‘ins’ that cause more problems than the ‘dels’, but I’m getting ahead of myself.
One of the things no one talks about when doing phylogenetics is sequence alignment. When you have a bunch of DNA sequences, they need to be aligned: in other words, position 62 (the 62nd nucleotide) in sequence A may not be biologically equivalent to position 62 in sequence B. For instance, in a protein, if sequence A has lost an amino acid (three nucleotides), position 62 in sequence A may be equivalent to position 65 in sequence B:
sequence A: CCC—ACG
sequence B: CCCAGGACG
But 16S sequences are even worse. To make a very complex story short, many parts of the molecule are so variable that in distantly related organisms you can’t align parts of the molecule (they are not homologous). Most alignment methods align the 16S sequence to a much longer reference sequence. For example, one commonly used aligner (NAST) stretches the 1500 bp 16S molecule out to over 7200bp, meaning that roughly 80% of the aligned sequence is composed of gaps (this should give you some idea just how variable parts of this molecule when all of the eubacteria are examined). This molecule is too complex to successfully align without a roadmap.
OK, so what the fuck does this have to do with indels? Deletions aren’t that much of problem, since alignment to a master sequence only results in an extra gap. While missing information isn’t good, it’s not incorrect information. Here’s a cartoon form–the boldface is the true sequence; the rest are the observed reads:
But insertions are a serious problem because insertions ‘create’ a base that should not be there. This fouls up the alignment to the master, meaning that not only do you have a nucleotide’s worth of false information, but that the other nucleotides are misaligned, giving even more false information. A similar cartoon example again:
Sequence #3 is a deletion (shown in the first example) and it doesn’t screw up the rest of the alignment. But look at #2 which contains an insertion of a C. We now seeing supposed sequence differences: the third position is a C instead of an A, the fourth an A instead of a G, the fifth G instead of T. Worse, the sixth position which should not exist at all but should be a gap is now a T. Compare #2 to #4, the latter which is a highly divergent sequence. Now we think the T from #2 should be treated like the T from #4, when that isn’t the case (the real sequence has a gap in the sixth position–the only homologous (or ‘equivalent’) base is for all of the sequences is the first position.
I’ve been looking at lots of repeated sequences from a small set of known DNAs (i.e., not a natural community, but one we created in the lab), and this is a significant problem–and it’s quite common. Mind you, this isn’t my discovery; since the advent of this technique these errors are well known, but if you’re sequencing a genome, you just add more reads until most of the reads support one answer****.
What’s nuts is that those who use phylogenetic methods don’t seem to be worried about these sequencing errors. If I were doing a population genetics study (sequencing a few genes from a number of organisms with Sanger sequencing), I would never publish data this variable*****. But unlike Sanger sequencing there’s no alternative because you can’t resequence the same clone–with 454, you get one shot at this.
We (the scientific group ‘we’, not the Mad Biologist we) are going to have to figure how to treat these data–and be very careful using distance and phylogenetically based analytical tools. Of course, you analyze the data you have, not the data you wish you had….
*There are actually serious problems with 16S, but that train, sadly, has left the station. Not much the Mad Biologist can do about it.
**Needless to say, cracking open the cells and isolating the DNA while still accurately representing the ‘real’ abundance of different organisms is not trivial–by a long shot.
***Actually, they’re sequence reading errors, but that’s beyond the scope of this post.
****Of course, with a large enough genome, one would predict that, very rarely, this ‘consensus’ method would give you the wrong answer.
*****Variable: heckuva euphemism.