Andy Clark has written a review of comparative evolutionary genomics for Trends in Ecology and Evolution. His review deals with identifying functional regions of the genome and inference of both positively and negatively selected sequences.
Clark is one of the leaders in the field of evolutionary genetics (and now genomics), actively participating in the analysis of both the human and Drosophila genomes. He also brings a solid understanding of biology, as well as an appreciation of statistical rigor. You can sense his excitement about the union of molecular biology and evolution in the following passage:
One of the most wonderful things about comparative genomics is that it has turned a whole generation of molecular biologists into evolutionists, full of excitement about the way that evolution has sculpted exquisite modifications to organismal genomes and eager to tell stories about it.
Clark is also cautious about the conclusions we can draw from the preliminary analyses completed thus far. He does not appear to be happy with the sloppy work of some investigators:
At the same time, one of its worst disasters is that it has created a hoard of genomics investigators who think that evolutionary biology is just fun, speculative story telling. Sadly, much of the scientific publication industry seems to respond to the herd as much as it does to scientific rigor, and so we have a bit of a mess on our hands. Fortunately, this is all a temporary aberration and, eventually, the noise will be separated from the signal, and progress will march on in understanding what genome sequence divergence really means.
I hope he is correct that a lack of rigorous analysis is a “temporary aberration”. There are two explanations for the sloppy work Clark is describing: a lack of understanding of proper statistical procedures or a disregard for them. I hope that the problem stems from the former and not the latter. A lack of understanding can be overcome through education, whereas a disregard for the scientific rigor requires a major shift in the entire field. We would have to convince researchers, reviewers, and publishers that such shoddy research is not acceptable for publication without correcting for incomplete statistical analyses.
Examples of such analyses are studies which look for conserved non-coding sequences. The authors of such studies argue that the conserved sequences are under purifying selection, although they fail to reject the hypothesis that the sequences are conserved due to low mutation rates. Doing so requires polymorphism data, and Clark applauds the researchers who are using polymorphism as well as divergence to detect selective constraint.
We also must be cautious when inferring positive selection using polymorphism data. Clark warns against using SNP data (such as that available in the HapMap project) because one falls victim to the problems associated with ascertainment bias. SNPs are fine for association studies, but the identification of genes under positive selection should be carried out using complete sequences. One must also control for the effects of demography on nucleotide sequence polymorphism by scanning multiple loci. Clark was a co-author on one such study which I briefly discussed here.
Clark also describes how analyses of genome content (duplications, gene gain, gene loss, gene order, gene density) require evolutionary explanations, and he explains that complex models of mutation are necessary to describe the evolution of the genome. Mutation rates are heterogeneous and depend on the genomic region (both local and global) in which a nucleotide is found.
Genomics has become a huge enterprise, both in the public and private sectors. Almost all of the research in this area is carried out in a comparative evolutionary framework. In a sense, nothing in genomics make sense except in the light of evolution.