Inferring deep history in genes

When Mendelism reemerged in the early 20th century to become what we term genetics no doubt the early practitioners of the nascent field would have been surprised to see where it went. The centrality of of DNA as the substrate which encodes genetic information in the 1950s opened up molecular biology and led to the biophysical strain which remains prominent in genetics. Later, in the 1970s Alan Wilson and Vincent Sarich used crude measures of genetic distance to resolve controversies in paleontology, specifically, the date of separation between the human and ape lineage. Genetics spans the physical and historical sciences, whereas physically oriented scientists may look to DNA as a basis for computation, historically oriented scholars can use it to illuminate mysteries in their own fields.

In the 1980s the "mitochondrial Eve" arrived on the scene, purporting to map out the demographic history of our species over the past 200,000 years. This was during an era when extraction and amplification of genetic material was primitive, and so the numerous mitochondria were the preferred sources of information. Additionally, the uniparental nature of mtDNA makes it ideal for a coalescent model.

Over the past two decades science has come much farther. Genetic material is easier to analyze, and the computers to do that analysis have become much more powerful. A non-trivial segment of the genome is now being brought to bear on questions of genetic history. More powerful computational techniques mean that the complexity of the models can be cranked up. This is evident in a recent paper, Inferring the Joint Demographic History of Multiple Populations from Multidimensional SNP Frequency Data:

Demographic models built from genetic data play important roles in illuminating prehistorical events and serving as null models in genome scans for selection. We introduce an inference method based on the joint frequency spectrum of genetic variants within and between populations. For candidate models we numerically compute the expected spectrum using a diffusion approximation to the one-locus, two-allele Wright-Fisher process, involving up to three simultaneous populations. Our approach is a composite likelihood scheme, since linkage between neutral loci alters the variance but not the expectation of the frequency spectrum. We thus use bootstraps incorporating linkage to estimate uncertainties for parameters and significance values for hypothesis tests. Our method can also incorporate selection on single sites, predicting the joint distribution of selected alleles among populations experiencing a bevy of evolutionary forces, including expansions, contractions, migrations, and admixture. We model human expansion out of Africa and the settlement of the New World, using 5 Mb of noncoding DNA resequenced in 68 individuals from 4 populations (YRI, CHB, CEU, and MXL) by the Environmental Genome Project. We infer divergence between West African and Eurasian populations 140 thousand years ago (95% confidence interval: 40-270 kya). This is earlier than other genetic studies, in part because we incorporate migration. We estimate the European (CEU) and East Asian (CHB) divergence time to be 23 kya (95% c.i.: 17-43 kya), long after archeological evidence places modern humans in Europe. Finally, we estimate divergence between East Asians (CHB) and Mexican-Americans (MXL) of 22 kya (95% c.i.: 16.3-26.9 kya), and our analysis yields no evidence for subsequent migration. Furthermore, combining our demographic model with a previously estimated distribution of selective effects among newly arising amino acid mutations accurately predicts the frequency spectrum of nonsynonymous variants across three continental populations (YRI, CHB, CEU).

The author summary is a bit plainer:

The demographic history of our species is reflected in patterns of genetic variation within and among populations. We developed an efficient method for calculating the expected distribution of genetic variation, given a demographic model including such events as population size changes, population splits and joins, and migration. We applied our approach to publicly available human sequencing data, searching for models that best reproduce the observed patterns. Our joint analysis of data from African, European, and Asian populations yielded new dates for when these populations diverged. In particular, we found that African and Eurasian populations diverged around 100,000 years ago. This is earlier than other genetic studies suggest, because our model includes the effects of migration, which we found to be important for reproducing observed patterns of variation in the data. We also analyzed data from European, Asian, and Mexican populations to model the peopling of the Americas. Here, we find no evidence for recurrent migration after East Asian and Native American populations diverged. Our methods are not limited to studying humans, and we hope that future sequencing projects will offer more insights into the history of both our own species and others

The basic building block of this paper is "allele frequency spectrum," or AFS. The text explains it rather well:

Given a genetic region sequenced in multiple individuals from each of P populations, the resulting AFS is a P-dimensional matrix. Each entry of this matrix records the number of diallelic genetic polymorphisms in which the derived allele was found in the corresponding number of samples from each population. For example, if diploid individuals from two populations were sequenced, with 10 individuals from population 1 and 5 from population 2, the AFS would be a 21-by-11 matrix (indexed from 0). The [2,0] entry would record the number of polymorphisms for which the derived allele was seen twice in population 1 but never seen in population 2, while the [20],[5] entry would record polymorphisms for which the derived allele was homozygous in all individuals from population 1 and seen 5 times in population 2. If all polymorphic sites possess only two alleles and can be considered independent, the AFS is a complete summary of the data. Many of the statistics commonly used for population genetic inference, such as FST and Tajima's D , are summaries of the AFS

The model also uses diffusion equations to project change in genetic frequencies, a technique first utilized by R. A. Fisher. Since Fisher's previous research had been in thermodynamics, it was natural for him to analogize change in gene frequencies to the sort of fluxes which one sees in heat related processes. It seems that the combination of approaches that they used here is moderately scalable:

The computational advantage of the diffusion method is even larger when placed in the context of parameter optimization. Unlike the coalescent approach, there is no simulation variance, so efficient derivative-based optimization methods can be used. As examples, consider our applications to human data, which involve 20 samples per population. On a modern workstation, fitting a single-population three-parameter model took roughly a minute, while fitting a two-population six-parameter model took roughly 10 minutes. The fits of three-population models with roughly a dozen parameters typically took a few hours to converge from a reasonable initial parameter set. This speed allows us to use extensive bootstrapping to estimate variances, overcoming the limitations of composite likelihood.

They did have to stop at 3 populations because of computational problems when attempting to analyze 4 populations. When looking at the Mexican American data set they discarded Africans for the purpose of comparison with outgroups. It seems the biggest step forward here is that they are able to add in a host of evolutionary genetic parameters into their model, in particular selection and migration, without too much difficulty. The technique also seems robust to the deviations which selection or linkage would introduce into evolutionary history of populations where neutral evolution would be the markers from which one would infer population fission. Previous models of population movements out of Africa were often unrealistic insofar as it was assumed that populations were exclusively fissiparous and that future back-migration would never occur. That is fine as it goes, but if it is possible to construct models which don't explode because of computational intensity then it seems prudent to do so.

Reading the whole paper through it seems that this is more a signpost to the future in method rather than a treasure-trove of results. Shifting particular demographic events by a few tens of thousands of years is important on the margin, but with the confidence intervals I'm not sure what to make of the results on their own (though in concert with other findings they are of more interest). Instead, consider that they've put the software which they used to generate these results online:

Diffusion Approximation for Demographic Inference (âaâi )

Citation: Gutenkunst RN, Hernandez RD, Williamson SH, Bustamante CD, 2009 Inferring the Joint Demographic History of Multiple Populations from Multidimensional SNP Frequency Data. PLoS Genet 5(10): e1000695. doi:10.1371/journal.pgen.1000695


More like this

"The discovery of DNA as the substrate which encodes genetic information in the 1950s": wasn't its role realised earlier? C & W attacked its structure because its importance had already been recognised. (I'm sure you'll correct me if I'm wrong.)

P.S. What do you make of Steve Jones's article in today's Telegraph?…

By bioIgnoramus (not verified) on 26 Oct 2009 #permalink

Ig, negatory. Protein was the more popular option prior to Oswald Avery's experiment in '44.

DNA was widely considered too simple in structure to provide the genetic code. This was really pretty daft. Today I doubt this would happen because we are so cognizant that all objective information can be expressed using only two symbols.

I seem to recall that Avery's result using bacterial transformation was not highly convincing to all people, and that the Hershey-Chase phage experiment in '52 was important in making Avery's finding into a universal assumption. This experiment showed that T2 phage infected E coli using only DNA. The phage virion also contains protein, which remains non-covalently bound to the outer surface of the bacterium, and could be sheared off by Hershey and Chase using a kitchen blender to generate strong fluid turbulance.

By Eric Johnson (not verified) on 26 Oct 2009 #permalink

"If all polymorphic sites possess only two alleles and can be considered independent..."

The first seems ok, but I don't know how to evaluate the second assumption. Their SNPs are presumably spaced such that physical linkage can be ignored, but what about other sources of disequilibrium? Honestly, I really don't understand their model well enough to determine if this is a potentially serious problem. As a kind of approach, though, this is where we need to be going for sure. Cool paper.


You know, you bring up a sorta funny point. It seems like a lot of pop gen studies are either 1)ignoring disequilibrium entirely or 2)using disequilibrium to make inferences. It's sometimes comforting that the two approaches converge on similar results... sometimes...

You know, I just realized that this paper uses a forward diffusion to find the allele frequency spectrum. There was a paper a few years ago ( ) that did that too---but they spent pages on proving the boundary conditions. This paper they just say "Because the diffusion equation is linear, we can solve simultaneously for the evolution of all polymorphism by continually injecting density at low frequency in each population (at a rate proportional to the total mutation flux ), corresponding to novel mutations."

While this turns out to be correct (c.f. the paper I linked), it's sort of funny that some people are just like "eh... I'll just do this and hope it works."

I have no criticisms to the model used, nor their the calculated separation of various humans. It could be fined down eventually to bring the times more in line with the archeological evidence for the Americas as the colonisation of America by humans is better understand and has less stupid and racist viewpoints attached to it. Europe's colonisation has racist and racialist viewpoints attached to it. The Paleolithic true blue European versus the Neolithic blow ins from the Middle East. Personally I could not understand how the archaeological evidence of early human colonisations prove anything about their continuation to modern Europeans. Cro Magnon may have been modern and looked Caucasoid, actually more like a hybrid African/Mongoloid/Caucasoid but he and his kind probably went extinct like the Neanderthals. Modern Europeans appear to be the descendants of human colonisation from South west and Central Asia as from the Mesolithic which continued with the advent of the Holocene and the Neolithic farming lifestyle. The date of separation of Caucasoids from Mongloids or West Eurasians from East Eurasions seems just about right to me.