The Human Microbiome, 16S, Pyrosequencing, and Phylogenetics?

I'm not certain you can have all four. Let's start at the beginning.

Just to review, one way to examine the human microbiome--the organisms that live on and in us--is extract the DNA from a biological sample (usually something from a person that is slimy, stinky, or both, such as feces or a nasal swab), and then amplify, using PCR, the 16S rRNA gene. 16S is found in all archaea and eubacteria, and can serve as a barcode that can be used to identify microbes*. This approach means you don't have to culture the bacteria and count things on agar plates (assuming they can be cultured in the first place); you can just crack open the cells and amplify the 16S, and sequence a bunch of 16S DNA molecules**.

Pyrosequencing is viewed as a great leap forward: compared to traditional Sanger sequencing, one can generate roughly two to three orders of more reads (sequences of 16S DNA molecules) for a comparable price.

However, there are some drawbacks to pyrosequencing (often referred to as '454' after the manufacturer). First, 454, even with the soon-to-be-released upgrade, only yields around 450 nucleotides (450 'bp'). That's not a lot of information--looking at some data we generated, classifying sequences to genus isn't that successful. To put this in perspective, most, though not all, bacterial genera contain bacterial species that whose most recent common ancestor can be ten millions of years in the past (of course, if you're a creationist, I suppose this isn't a problem at all).

But that's not the real problem with 454 16S work. The real problem is with 'indels' which is sort for insertions and deletions. Due to sequencing errors***, often an extra base is added or deleted in homopolymeric runs (e.g., CCC is accidentally read as either CC or CCCC).

So moving onto phylogenetic analysis (reconstructing evolutionary histories). One commonly method of determining if two samples of microbes are different uses phylogenetics--for example, you might want to know if people with eczyma have different skin microbial communities from those who don't have eczyma. (Phylogenetics is also used in one method of taxonomically classifying sequences). Indels cause a real problem for phylogenetic analysis. Actually, it's the 'ins' that cause more problems than the 'dels', but I'm getting ahead of myself.

One of the things no one talks about when doing phylogenetics is sequence alignment. When you have a bunch of DNA sequences, they need to be aligned: in other words, position 62 (the 62nd nucleotide) in sequence A may not be biologically equivalent to position 62 in sequence B. For instance, in a protein, if sequence A has lost an amino acid (three nucleotides), position 62 in sequence A may be equivalent to position 65 in sequence B:

sequence A: CCC---ACG
sequence B: CCCAGGACG

But 16S sequences are even worse. To make a very complex story short, many parts of the molecule are so variable that in distantly related organisms you can't align parts of the molecule (they are not homologous). Most alignment methods align the 16S sequence to a much longer reference sequence. For example, one commonly used aligner (NAST) stretches the 1500 bp 16S molecule out to over 7200bp, meaning that roughly 80% of the aligned sequence is composed of gaps (this should give you some idea just how variable parts of this molecule when all of the eubacteria are examined). This molecule is too complex to successfully align without a roadmap.

OK, so what the fuck does this have to do with indels? Deletions aren't that much of problem, since alignment to a master sequence only results in an extra gap. While missing information isn't good, it's not incorrect information. Here's a cartoon form--the boldface is the true sequence; the rest are the observed reads:

CCC
CC
CC
CCC
CCC
CCC

But insertions are a serious problem because insertions 'create' a base that should not be there. This fouls up the alignment to the master, meaning that not only do you have a nucleotide's worth of false information, but that the other nucleotides are misaligned, giving even more false information. A similar cartoon example again:

CCAGT-
CCCAGT (2)
C-AGT- (3)
CCAGT-
A----T (4)

Sequence #3 is a deletion (shown in the first example) and it doesn't screw up the rest of the alignment. But look at #2 which contains an insertion of a C. We now seeing supposed sequence differences: the third position is a C instead of an A, the fourth an A instead of a G, the fifth G instead of T. Worse, the sixth position which should not exist at all but should be a gap is now a T. Compare #2 to #4, the latter which is a highly divergent sequence. Now we think the T from #2 should be treated like the T from #4, when that isn't the case (the real sequence has a gap in the sixth position--the only homologous (or 'equivalent') base is for all of the sequences is the first position.

I've been looking at lots of repeated sequences from a small set of known DNAs (i.e., not a natural community, but one we created in the lab), and this is a significant problem--and it's quite common. Mind you, this isn't my discovery; since the advent of this technique these errors are well known, but if you're sequencing a genome, you just add more reads until most of the reads support one answer****.

What's nuts is that those who use phylogenetic methods don't seem to be worried about these sequencing errors. If I were doing a population genetics study (sequencing a few genes from a number of organisms with Sanger sequencing), I would never publish data this variable*****. But unlike Sanger sequencing there's no alternative because you can't resequence the same clone--with 454, you get one shot at this.

We (the scientific group 'we', not the Mad Biologist we) are going to have to figure how to treat these data--and be very careful using distance and phylogenetically based analytical tools. Of course, you analyze the data you have, not the data you wish you had....

*There are actually serious problems with 16S, but that train, sadly, has left the station. Not much the Mad Biologist can do about it.

**Needless to say, cracking open the cells and isolating the DNA while still accurately representing the 'real' abundance of different organisms is not trivial--by a long shot.

***Actually, they're sequence reading errors, but that's beyond the scope of this post.

****Of course, with a large enough genome, one would predict that, very rarely, this 'consensus' method would give you the wrong answer.

*****Variable: heckuva euphemism.

Tags

More like this

Back in 2007, Wang et al. from the Ribosomal Database Project at MSU published a paper in AEM entitled Naive Bayesian Classifier for Rapid Assignment of rRNA Sequences
into the New Bacterial Taxonomy
(PDF, 7 pages).

There, they looked at both full length and shorter rDNA sequence reads. The overall classification accuracy rate for full length sequence reads was 91.4% at the genus level. 400 bp sequence read accuracy was 88.7% ... not a very significant drop ... especially when you factor in cost. The 400 bp read can be done with one reaction (and on a pyrosequencer), the full length read, if you're lucky can be done with a forward and reverse reaction on an ABI 3730. So, if it's the 454 that is the problem, you can do the 400 bp read twice (which will still, on average give you data faster and cheaper than conventional sequencing), and align the two reads for accuracy. If there is a sequencing error, it'll show up here and can be corrected.

It's a commonly understood issue, at least it was for me when I entered this field, that you'll get some sequence artifacts. Errors introduced through PCR should be rare, especially if you're using a proof-reading enzyme ... and if people aren't, why the fuck aren't they? What is more troubling -- to me -- is the rate of chimeric sequences which reside in GenBank. Some studies have been found to contain over 50% chimeric sequences in their data. That's criminal! It's also another reason to consider shorter reads. A shorter read means less incomplete PCR, and less of a chance for chimera formation.

Mike, your talents are way over my head...but, i have extracted from my head hundreds of tiny layered growing hidden "beans"....from embedded "lesions". I believe that their predictable design and behavior defines what is presently termed as the new mystery skin disease...morgellons. I have opened these hidden lesions with aloe, and "fizzed" their inner construction using baking soda, and extracted their retractable and elastic white coated "bean bag" artist.... and looked at them on the tv screen using a 200x "Pro-scope" Tv microscope. I have sandwiched "live" samples in clear packageing tape, and daily photographed the tv screen samples for study and evidence of a very strange, bizarre and difficult to remove sticky substance that i believe is an airborn environmental contaminant. Once embedded in the skin, apparently this critters success at survival encludes exuding what i believe is a calcium carbonate covering that seals its "biosphere" under the skin, maintaining pressure and providing exchange of gasses. From observing my photos, i am lead to guess this is a dwarf size red abalone like anatomy. I can only guess and end up, at worst...a fool for my efforts. I found your site because i see similarities in color and design to the "little squirts" Botrylloides and Didemnum. However, the anatomy i see in the lesion samples appears to have a stage of elongation and tortion. If you, or anyone you know would be willing to help raise my simple efforts to bring this invader to light... to a higher standard of discovery, it would be greatly appreciated. Kind regards, Nancy