'Misunderestimating' Natural Selection

From the archives, here's something about how we might be underestimating the strength of natural selection when we look at molecular data:

PZ Myers has a superb summary of a very interesting PLoS paper. In the paper, the authors identify those genes that have experienced strong selection, and thus might be responsible for the chimpanzee-human divergence (PZ Myers has a great summary):

With all the data available from the human genome project and the ongoing chimpanzee genome project, we can start comparing DNA sequences. One parameter that can be assayed is the frequency of synonymous changes in the DNA: these are changes in the nucleotide sequence that produce synonyms in the triplet code, and therefore cause no changes at all in the protein sequence. These changes represent a kind of steady background noise, the rate of random, neutral changes in the genome. Non-synonymous changes, on the other hand, do change the amino acid sequence of the resulting protein, and are presumed to be more likely to have some kind of effect on the phenotype. The ratio of nonsynonymous to synonymous nucleotide changes within a gene, dN/dS, is a measure of the history of selection for change in that gene. High dN/dS values mean there has been selection pressure for novel forms, while low dN/dS values mean selection has been working to conserve the sequence.

So here's the analysis: go through the list of human genes, find each one's homolog in the chimpanzee, compute the dN/dS ratio, and rank them in order. What you end up with is a list, with the genes that have experienced the strongest selection for new properties between the two species at the top. Note that you can't tell which of the two species has changed the most from their common ancestor from this analysis (although comparison with an outgroup can help with that), so all we know is which genes have diverged the most.

Here's my problem with the article: this method will miss many, many genes. In other words, many 'important' genes will be missed. Now, this isn't the authors' fault: to paraphrase Rumsfeld, sometimes you have to analyze the genomes you have, not the genomes you wish you had. Note the plural genomes. But I'm getting ahead of myself.

Imagine a gene 300 amino acids long (that's 900 base pairs of DNA; every three bases codes for one amino acid or codon). In many genes, most of the non-synonymous substitutions will be deleterious (dN/dS at that codon will be very close to zero), some will be neutral (dN/dS = 1), and a few will be beneficial (dN/dS > 1). If you average across the gene, the ratio of dN/dS will be much lower than 1. However, this doesn't mean that the gene isn't evolutionarily important: the few beneficial non-synonymous substitutions could be doing evolutionary backflips (dN/dS >> 1), and a gene-wide summary statistic still won't detect selection at this genes because you average dN/dS across all sites.

I'm not arguing a hypothetical case here. I'm currently in the process of submitting a manuscript about a gene in E. coli involved in the ecological divergence between 'harmless' E. coli and those involved in urinary tract infections. In this gene, about 2% of the amino acids appear to have a dN/dS ratio > 1.0, and in almost all of the other amino acids, amino acid substitutions are deleterious (dN/dS ~ 0.1). This gene has a gene-wide dN/dS ratio ~ 0.07, yet we know from functional and experimental studies that this gene is vital in the ecological divergence between the harmless and pathogenic forms. The 'PLoS' ranking system would most likely miss this gene.

Now, if your eyes haven't completely glazed over at this point, you're wondering, "How the hell does he know what's happening at each codon?" Simple. I'm the Mad Biologist. Never, ever doubt the Mad Biologist.

Seriously, there is a method known as the codon substitution method (for the technical details and paper, click here). Essentially, this method allows you to examine the dN/dS ratio for each amino acid, as opposed to the whole gene. I won't get into the technical details here, but what this method would require for the chimp-human analysis is lots of human and chimp genomes (at least ten of each, although two of each is the bare minimum and not very reliable). This is why I said earlier that you analyze the genomes you have, not the genomes you wish you had.

The punchline is that while this is a very interesting paper, I think we might be missing a lot of evolutionarily important genes simply because many, though not all, non-synonymous changes in these 'missed' genes are removed by natural selection. Instead, the PLoS method will be biased towards genes whose amino acid structure can tolerate a lot of change without a degredation of function. What this means is that there might be even more genes that are responsible for the chimp-human divide. That's pretty cool.

Note to creationists: If I catch a single one of you using this post to somehow try to 'undermine' the theory of natural selection, I'm going to flame your lame ass. The whole damn point of this post is that we might be underestimating the power of natural selection. In science, as opposed to crackpot theology, we use deduction and induction. Sometimes, in the face of incomplete evidence, we disagree over the particulars.

Tags

More like this

Bad tests for natural selection are bad at detecting selection. Austin Hughes has published a fairly critical review of some methods used to detect natural selection in protein coding sequences. His attack on current methods for detecting natural selection is threefold. First, he claims that…
Mike Lynch has been getting a fair bit of hype recently for his nearly neutral model of genome evolution (see here and here). The nearly neutral theory riffs off the idea that the ability of natural selection to purge deleterious mutations and fix advantageous mutations depends on the effective…
Polymorphism and Divergence This is the eighth of multiple postings I plan to write about detecting natural selection using molecular data (ie, DNA sequences). The introduction can be found here. The first post described the organization of the genome, and the second described the organization of…
In addition to the paper on adaptive evolution in the Drosophila melanogaster genome (reviewed here yesterday), Chung-I Wu is also senior author on a sort-of companion paper studying adaptive evolution in the human genome. Yeah, I know, who really cares about the human genome, human evolution, or…

Well, as a chemist who briefly played around in the protein folding area some 15 years ago during my postdoc, I'm surprised at how low the ratios are. Aside from active site residues, isn't the main purpose of a lot of amino acids just to make the protein fold up right? And can't you do a lot of "conservative" changes in amino acids without messing up the basic fold? Leucine for isoleucine? Tyrosine for phenylalanine, that sort of thing?

If you have different dN/dS ratios, might that just be telling you something about where the amino acids are in the fold? Tightly packed hydrophobic core--low ratio; hydrophilic surface--high ratio. And these might be just next to each other in an alpha helix.

By Michael Schmidt (not verified) on 09 Apr 2007 #permalink

...and now I'm wondering if the codon-by-codon analysis may be a way of predicting folding of uncharacterized proteins, rather than a way of making bold statements about gene selection pressure....

By Michael Schmidt (not verified) on 09 Apr 2007 #permalink