I suppose I'll have to start at the beginning. One of the things we do in the microbiome project is analyze microbial communities by assessing the diversity of 16S DNA sequences*. In other words, we want to know what species or 'operational taxonomic units'--OTUs--are present and how many of that OTU are there.
Operationally, we are taking a bunch of DNA sequences and placing them in categories. This, of course, is more difficult than it sounds.
One of the issues with 16S is that many (nine to be exact) regions of the gene are hypervariable--they evolve so quickly that these regions can't be compared between distantly related strains. Of course, you could compare them, but you couldn't make any evolutionary sense of them because sites will have changed multiple times and you can't see that (e.g., an A -> T might really be an A -> G -> C -> T)**.
To make sure that we don't compare apples and oranges in our sequence, when we align sequences, we place a lot of gaps in the post-alignment sequences (this is a cartoon example):
ACGT might become AC--GT
CGAT --CGAT
So, onto classification. One way that we can place sequences into groups is through distance-based methods. For example, we can create categories where there is less than three percent divergence among sequences (three percent divergence means that in a sequence of 100 nucleotides, three differ between the two sequences***)****.
The problem stems from the gaps--you don't want to look at the gaps when calculating genetic distance (again, this is to avoid the multiple hits and non-homology problems). The 'gold standard' program that is often used in microbiome studies is DNADIST (which doesn't look at gaps). However, DNADIST is slow. There is no way it can handle the amounts of sequence generated; it's too mathematically intensive (while it hasn't been decided yet, my guess is that the part of the Human Microbiome Project that looks at 250 people will generate at least 50,000,000 reads. Even if only a fraction of these are unique, it's still too damn many).
There are faster methods (k-mer searching approaches) but these have gap issues. My colleagues and I are kludging together some workarounds, but does anyone know of a fast distance calculator that doesn't have trouble with gaps?
*The 16S gene is found in all bacteria and can be used as a 'barcode' to identify bacterial species, without having to grow the bacteria.
**There's a second issue which is that 16S doesn't encode for a protein--what matters is the secondary structure of the RNA molecule. This means that changes in one part of the molecule can select for changes in another part to ensure binding or formation of various loop structures.
I hate 16S.
***Calculating sequence divergence can be done by a variety of methods, but now I'm getting depressed.
****Of course, this isn't trivial either. Do we mean that all sequences must be less than three percent different from each other, or some, or only one?
- Log in to post comments
Greengenes does a really good job at processing large numbers of 16S sequences, and aligning them without an explosion of indels. But it still uses DNADIST (DNAML) to calculate distances.
It sounds like your main problem is essentially that you don't want to calculate 50,000,000^2 distances using DNADIST?
No matter how fast your code, calculating 2.5x10^15 individual distances will take a *lot* of time. A much faster approach might be to first cluster the sequences (there's some fast ways to do that without calculating all pairwise distances), calculate pairwise distances only among the closest pairs, then estimate the largest distances by triangle inequality.
Sure, you'll lose some resolution between the most distantly related strains, but those distances aren't that useful for tree building anyway.