Detecting Natural Selection (Part 7)

Nucleotide Polymorphism and Selection

This is the seventh of multiple postings I plan to write about detecting natural selection using molecular data (ie, DNA sequences). The introduction can be found here. The first post described the organization of the genome, and the second described the organization of genes. The third post described codon based models for detecting selection, and the fourth detailed how relative rates can be used to detect changes in selective pressure. The fifth post dealt with classical population genetics methods for detecting selection using allele and genotype frequencies. The sixth post described how to calculate nucleotide sequence polymorphism.

In the previous post, we discussed two different measures of nucleotide sequence polymorphism: the number of segregating sites (S) and the average pairwise differences (π). Both of these parameters are influenced by the per nucleotide mutation rate (u) and effective population size (Ne; effective population size is the size of an idealized population that is evolving in a similar manner as our actual population). It is very difficult to empirically estimate either u or Ne, but the two are related in what I like to think of as the population genetics parameter, θ = 4Neu.

The two measures of polymorphism (π and S) can be used to estimate θ. As I mentioned previously, π is independent of the number of sequences in your sample. Additionally, if our sequenced region is evolving according to neutral expectations (ie, no selection), π is an unbiased estimator of our population genetics parameter, θ. Recall that S depends on the number of sequences in our sample, so we must apply a simple correction to this parameter if we want to relate it to θ (I won't get into the details of this correction here). This relationship, just like the one between π and θ, assumes the sequences are evolving neutrally.

Fumio Tajima showed that we can compare the two estimates of θ to detect signatures of natural selection in nucleotide sequences. Under neutral expectations, both π and S should lead to similar estimates of θ, and the difference between those estimates should be approximately zero. If, however, the two estimates differ, we have evidence for selection (the type of selection determined by which estimate is greater). Tajima defined the summary statistic D as the difference between the two estimates of θ -- the estimate of θ using π minus the estimate from S.

D = θ(π) - θ(S)


If a new allele that confers a fitness benefit arises in a population, it is expected to be driven to fixation by natural selection. This should remove most of the genetic variation linked to the site under selection (a, so-called, selective sweep), and the variants on the same genetic background as the advantageous mutation are expected to be driven to fixation (known as hitchhiking). A selective sweep, therefore, decreases the amount of polymorphism surrounding the site under selection, affecting both S and π. As the region surrounding the site under selection recovers from the sweep, the amount of polymorphism will increase. If we create a genealogy using the sequence around the site under selection, it will appear star-shaped (see below). Following a selective sweep, S recovers faster than π, which means our estimate of θ will be greater when we use S than when we use π; this leads to a negative D statistic. Because we are detecting the recovery from the selective sweep and not the sweep itself, it will take some time after the sweep for the pattern to appear, and it will eventually fade away once the polymorphism has recovered to normal levels.

i-5f08a92c6327e156c055343a072de8e8-coal_genealogy.bmp

A neutral genealogy (left), star-shaped genealogy (center), and the genealogy expected under balancing selection or population sub-division (right).


Balancing selection is also detectible using Tajima's D statistic. Balancing selection is expected to maintain excess polymorphism (relative to the neutral expectation) around the site under selection. This will result in two clusters of haplotypes (see the genealogy above), which become differentiated from each other over time (you can think of a haplotype as one of the sequences in the sample). Because there are an excess of differences between the haplotypes from the two clusters, the average pairwise differences in the sample will be greater than expect under a neutral model. In this case, our estimate of θ using π will be greater than that from S, resulting in a positive Tajima's D. Unlike detecting positive selection, the signature of balancing selection is expected to persist for as long as the polymorphism is maintained by selection.

It is important to remember that deviations from neutral expectations can be due to either natural selection or demography. We have shown how Tajima's D can be less than zero or greater than zero under a model of positive or balancing selection, respectively. We can also get a negative D statistic if a population is expanding because S will increase more rapidly than π resulting in a star-shaped genealogy. If we sample from two different populations (possibly accidentally if we do not know enough about our study organism) each population will contain unique haplotypes, increasing our measure of π. This will result in a positive D statistic. To distinguish between selection and demography, we can sample multiple loci throughout the genome. Demographic effects are expected to be seen throughout the genome, whereas selection will only affect sites near the locus under selection.

Tajima's D statistic (and it's cousin, Fay and Wu's H) is one of the most used techniques for detecting selection using nucleotide sequence polymorphism. In the next post I will detail how to combine nucleotide polymorphism with divergence to detect selection.

More like this

We measure (and value) many different kinds of diversity: cultural diversity in a city, species diversity in an ecosystem, subject diversity in a library, etc. Are there any situations where something analogous to Tajima's D applies? In other words, groups where you can measure diversity by the probability that two randomly chosen members are different, OR by the total number of ways in which members of the group are seen to differ, and these two measures are somehow comparable? For example, in a large collection of baroque classical CDs, you could greatly increase "S" by adding a few 50 Cent albums, but this wouldn't really increase "pi" very much, and maybe you really want a music collection that's diverse by both measures. Also, maybe a significant Tajima's D of music would imply that your collection was not randomly selected from all the music available at the store (?). At the very least, there might be some useful metaphors for teaching population genetics principles (if not objective ways to, say, evaluate a university affirmative action program).

The reason the derivation for Tajima's D works out is because the estimates of "diversity" that you are comparing (S after some tweeking and pi directly)are both estimators of the same parameter (theta).

This could work for any collection of "things" provided you could adequately and objectively assign states to different characters. For example, in your music collection, you would need to define characteristics that you think contribute to diversity (tempo, key, instrumentation etc.). You could then determine the average number of differences between any two CDs in your collection and the total number of characters that are "polymorphic" in your collection. If you do this, make sure that you have enough characters that some of them do not vary in your collection. It would be an interesting exercise to try.