When Mendelism reemerged in the early 20th century to become what we term genetics no doubt the early practitioners of the nascent field would have been surprised to see where it went. The centrality of of DNA as the substrate which encodes genetic information in the 1950s opened up molecular biology and led to the biophysical strain which remains prominent in genetics. Later, in the 1970s Alan Wilson and Vincent Sarich used crude measures of genetic distance to resolve controversies in paleontology, specifically, the date of separation between the human and ape lineage. Genetics spans the physical and historical sciences, whereas physically oriented scientists may look to DNA as a basis for computation, historically oriented scholars can use it to illuminate mysteries in their own fields.
In the 1980s the “mitochondrial Eve” arrived on the scene, purporting to map out the demographic history of our species over the past 200,000 years. This was during an era when extraction and amplification of genetic material was primitive, and so the numerous mitochondria were the preferred sources of information. Additionally, the uniparental nature of mtDNA makes it ideal for a coalescent model.
Over the past two decades science has come much farther. Genetic material is easier to analyze, and the computers to do that analysis have become much more powerful. A non-trivial segment of the genome is now being brought to bear on questions of genetic history. More powerful computational techniques mean that the complexity of the models can be cranked up. This is evident in a recent paper, Inferring the Joint Demographic History of Multiple Populations from Multidimensional SNP Frequency Data:
Demographic models built from genetic data play important roles in illuminating prehistorical events and serving as null models in genome scans for selection. We introduce an inference method based on the joint frequency spectrum of genetic variants within and between populations. For candidate models we numerically compute the expected spectrum using a diffusion approximation to the one-locus, two-allele Wright-Fisher process, involving up to three simultaneous populations. Our approach is a composite likelihood scheme, since linkage between neutral loci alters the variance but not the expectation of the frequency spectrum. We thus use bootstraps incorporating linkage to estimate uncertainties for parameters and significance values for hypothesis tests. Our method can also incorporate selection on single sites, predicting the joint distribution of selected alleles among populations experiencing a bevy of evolutionary forces, including expansions, contractions, migrations, and admixture. We model human expansion out of Africa and the settlement of the New World, using 5 Mb of noncoding DNA resequenced in 68 individuals from 4 populations (YRI, CHB, CEU, and MXL) by the Environmental Genome Project. We infer divergence between West African and Eurasian populations 140 thousand years ago (95% confidence interval: 40-270 kya). This is earlier than other genetic studies, in part because we incorporate migration. We estimate the European (CEU) and East Asian (CHB) divergence time to be 23 kya (95% c.i.: 17-43 kya), long after archeological evidence places modern humans in Europe. Finally, we estimate divergence between East Asians (CHB) and Mexican-Americans (MXL) of 22 kya (95% c.i.: 16.3-26.9 kya), and our analysis yields no evidence for subsequent migration. Furthermore, combining our demographic model with a previously estimated distribution of selective effects among newly arising amino acid mutations accurately predicts the frequency spectrum of nonsynonymous variants across three continental populations (YRI, CHB, CEU).
The author summary is a bit plainer:
The demographic history of our species is reflected in patterns of genetic variation within and among populations. We developed an efficient method for calculating the expected distribution of genetic variation, given a demographic model including such events as population size changes, population splits and joins, and migration. We applied our approach to publicly available human sequencing data, searching for models that best reproduce the observed patterns. Our joint analysis of data from African, European, and Asian populations yielded new dates for when these populations diverged. In particular, we found that African and Eurasian populations diverged around 100,000 years ago. This is earlier than other genetic studies suggest, because our model includes the effects of migration, which we found to be important for reproducing observed patterns of variation in the data. We also analyzed data from European, Asian, and Mexican populations to model the peopling of the Americas. Here, we find no evidence for recurrent migration after East Asian and Native American populations diverged. Our methods are not limited to studying humans, and we hope that future sequencing projects will offer more insights into the history of both our own species and others
The basic building block of this paper is “allele frequency spectrum,” or AFS. The text explains it rather well:
Given a genetic region sequenced in multiple individuals from each of P populations, the resulting AFS is a P-dimensional matrix. Each entry of this matrix records the number of diallelic genetic polymorphisms in which the derived allele was found in the corresponding number of samples from each population. For example, if diploid individuals from two populations were sequenced, with 10 individuals from population 1 and 5 from population 2, the AFS would be a 21-by-11 matrix (indexed from 0). The [2,0] entry would record the number of polymorphisms for which the derived allele was seen twice in population 1 but never seen in population 2, while the , entry would record polymorphisms for which the derived allele was homozygous in all individuals from population 1 and seen 5 times in population 2. If all polymorphic sites possess only two alleles and can be considered independent, the AFS is a complete summary of the data. Many of the statistics commonly used for population genetic inference, such as FST and Tajima’s D , are summaries of the AFS
The model also uses diffusion equations to project change in genetic frequencies, a technique first utilized by R. A. Fisher. Since Fisher’s previous research had been in thermodynamics, it was natural for him to analogize change in gene frequencies to the sort of fluxes which one sees in heat related processes. It seems that the combination of approaches that they used here is moderately scalable:
The computational advantage of the diffusion method is even larger when placed in the context of parameter optimization. Unlike the coalescent approach, there is no simulation variance, so efficient derivative-based optimization methods can be used. As examples, consider our applications to human data, which involve 20 samples per population. On a modern workstation, fitting a single-population three-parameter model took roughly a minute, while fitting a two-population six-parameter model took roughly 10 minutes. The fits of three-population models with roughly a dozen parameters typically took a few hours to converge from a reasonable initial parameter set. This speed allows us to use extensive bootstrapping to estimate variances, overcoming the limitations of composite likelihood.
They did have to stop at 3 populations because of computational problems when attempting to analyze 4 populations. When looking at the Mexican American data set they discarded Africans for the purpose of comparison with outgroups. It seems the biggest step forward here is that they are able to add in a host of evolutionary genetic parameters into their model, in particular selection and migration, without too much difficulty. The technique also seems robust to the deviations which selection or linkage would introduce into evolutionary history of populations where neutral evolution would be the markers from which one would infer population fission. Previous models of population movements out of Africa were often unrealistic insofar as it was assumed that populations were exclusively fissiparous and that future back-migration would never occur. That is fine as it goes, but if it is possible to construct models which don’t explode because of computational intensity then it seems prudent to do so.
Reading the whole paper through it seems that this is more a signpost to the future in method rather than a treasure-trove of results. Shifting particular demographic events by a few tens of thousands of years is important on the margin, but with the confidence intervals I’m not sure what to make of the results on their own (though in concert with other findings they are of more interest). Instead, consider that they’ve put the software which they used to generate these results online:
Citation: Gutenkunst RN, Hernandez RD, Williamson SH, Bustamante CD, 2009 Inferring the Joint Demographic History of Multiple Populations from Multidimensional SNP Frequency Data. PLoS Genet 5(10): e1000695. doi:10.1371/journal.pgen.1000695