Do It Yourself: searching for evolution's signature in 53 human populations

Note: I'm introducing Do It Yourself as a new and hopefully semi-regular section on Genetic Future. The aim is to provide readers with instructions on how to access online resources for sequence analysis - an activity traditionally restricted to researchers, but one that will no doubt become more common as more and more people begin to access and interpret their own genetic data.

In this post I'll introduce the brand new HGDP Selection Browser, a tool for exploring traces of recent positive selection in the human genome produced by researchers at the University of Chicago.

Introduction: the traces of positive selection in the human genome
i-755219a584b4457d9af44675ef95db82-multiethnic_panel_tiny.jpgOver the last 50-100,000 years the human species has successfully spread from its African homeland to colonise nearly every corner of the globe, including environments ranging from desert to tundra. Adaptation to these diverse environments, along with major dietary changes, rapid population growth and exposure to a range of novel infectious diseases, has left its mark on the human genome. Essentially, all of us carry a molecular record of our ancestors' adaptation - and with the recent growth of databases of human genetic variation, we can actually determine (albeit imperfectly) which genes played a role in this process.

Genetic variants that offer a benefit to the individuals who carry them - such that they have, on average, more surviving offspring than non-carriers - will tend to increase in frequency in a population through positive natural selection. This relatively rapid increase in frequency has a substantial impact on the region of the genome immediately around the selected variant, resulting in what's known as a "selective sweep": a local reduction in genetic diversity, and an elevation of long-range linkage disequilibrium.

(You can think of these signatures as being a consequence of a selected variant being younger (on average) than a neutral region at the same frequency, due to its rapid increase in frequency. The low genetic diversity and high linkage disequilibrium result from the fact that there hasn't been much time for mutation and recombination, respectively, to act on the section of the genome closely linked to the selected variant.)

If the selection is restricted to just one or a few populations (due to a specific environmental pressure, or simply a lack of the beneficial variant in other populations) there will also be an increase in population differentiation in the region; in other words, in this portion of the genome, human populations will tend to look more different to one another than they do in other areas.

These classic signatures of a selective sweep have been used to hunt for genes subject to recent positive selection in humans by many groups, taking advantage of genetic variation data from the HapMap project and various other sources. The results of these analyses have been used to argue that recent human evolution has been characterised by pervasive, often population-specific positive selection, presumably resulting from the adaptation of modern humans to diverse, novel environments outside of Africa.

What's been lacking in most of these studies is analysis of a broad range of human populations - most attempts, by necessity, have been restricted to the European, East Asian and West African populations samples by the HapMap project. That's changed now with the arrival earlier in the year of data on 650,000 genetic markers (SNPs) in 938 individuals from over 50 populations, using DNA samples from the Human Genome Diversity Panel.

These genome-wide data have now been analysed for various signatures of recent selection by a team at the University of Chicago - and while the publication isn't out yet, the team has generously made their data available through a nifty online browser. It's worth noting that the same group previously produced the Haplotter browser, which allows you to examine signatures of selection in the four HapMap populations (for an introduction to Haplotter and other online resources, see the So you want to be a population geneticist? post on GNXP).

Using the HGDP Selection Browser
If you want to see if your favourite gene shows a signature of population-specific positive selection, here's how the browser works:

Enter the name of a gene or a SNP into the window at the top left of the browser and it will take you to that region of the genome. You can then use the "Scroll/Zoom" function to move around or view a specific window size ranging from 100,000 to 5 million bases. Here I've taken a snapshot of a 1 Mb (1 million base) region around the EDAR gene - I find that a 1 Mb window size is a pretty good default for getting an idea of how the region looks:

i-a38cd8698c94eea05a20da986f6ad948-hgdp_browser1.jpg

EDAR is a classic example of recent population-specific selection in humans, emerging as a major East Asian-specific outlier in several large-scale studies of the human genome. (The reason for the selection is still unclear, but a common protein-altering variant in EDAR is associated with variation in hair morphology and affects cellular signalling).

The picture from the browser is thoroughly consistent with a selective sweep restricted largely to East Asia: many of the markers within or close to the EDAR gene show unusually high Fst (a measure of population differentiation), and the region as a whole displays elevated iHS and XP-EHH scores (both measures of extended linkage disequilibrium, with XP-EHH being particularly powerful for detecting population-specific selection) that are most dramatic in the East Asian population (green). (For all three measures the value plotted is the -log10 of the empirical P value for each SNP - so the higher the score, the more extreme an outlier that region is relative to the rest of the genome.)

By clicking on individual SNPs in the "Genotyped SNPs" row of the browser you can generate a map of the distribution of frequencies in the various surveyed populations - here's the map for one of the more interesting-looking variants in EDAR:

i-67190473454ec2c6302eeaa520d62697-hgdp_edar_map.jpg

You can see immediately that the more evolutionarily recent or "derived" version of this SNP is restricted mainly to East Asia and the Americas (where the native populations are relatively recent migrants from East Asia). That makes sense for a region where selection appears to have been almost entirely restricted to East Asian populations.

To drill down a bit further, click on the buttons next to "Haplotype plots" (towards the bottom of the page, just above the display settings) - you can either look at populations individually, or grouped together into 7 continental clusters. After some calculations (based on an algorithm from this paper) you end up with some rather complicated plots that look like this:

i-97e2ead1a977c9560d79055b45a99285-hgdp_cont_plots.jpg

These plots take a bit of getting used to, but here's the gist: each row in the plot represents an individual chromosome in the sample (for most genes that means two rows per individual), and sections of the row are coloured depending on whether they share the same combination of variants (also known as a haplotype). A variant that has been a target of recent positive selection should show a long, largely uninterrupted haplotype at high frequency in the population - and that is precisely what you see for the green haplotype in East Asians (the top panel). In the Middle East and South Asia (bottom two panels), where there has been no selection on the EDAR variant, there is a chaotic arrangement with no sign of a dominant, long-range haplotype - in other words, the sort of pattern you would expect under neutral evolution.

I've included massively expanded versions of the East Asian and Middle Eastern haplotype plots at the end of this post - you can obtain these expanded plots simply by clicking on the relevant graph.

Has your favourite gene been recently selected?
I asked a member of the team who put the browser together what he saw as the basic rules of thumb for deciding whether or not a region was a target of recent selection. He noted that the process is still more of an art than a science (which is understandable given the limitations and complexities of the data, and the fact that weakly selected variants will look essentially identical to unselected regions). However, he has this advice:

I'm most convinced if there's a strong (ie. empirical p in the top 1%, which is a 2 on the scales used in the browser) iHS or XP-EHH score and the haplotype plots look "good"--there's a long haplotype at high frequency in the populations with the significant score and not elsewhere. A big Fst also contributes to making the evidence stronger.

There are no absolute criteria distinguishing selected from neutral sites with perfect accuracy, so interpret with caution - but if you already have reason to suspect that a genetic variant might have been a recent target of selection (e.g. that most nebulous form of evidence, "biological plausibility"), this browser can give you a quick sense of whether this might be a possibility worth pursuing further.

A manuscript based on the data presented in this browser is currently in the works, with findings that I think will present a challenge to many in the human evolutionary genetics community - more on that later.

Subscribe to Genetic Future.

Here's those expanded haplotype plots for EDAR, showing strong evidence for selection on the green haplotype in East Asians, while individuals from the Middle East show no trace of recent selection:

i-e7b0b7e1bcb0310f5bde0f09eab910a9-easia_edar.gif
East Asia (green haplotype under selection)

i-2d00fcc6badb9ea487c447f6bb00f319-mideast_edar.gif
Middle East (no strong recent selection)

Face images modified from thumbnails at Face Research.

More like this

Pickrell, J., Coop, G., Novembre, J., Kudaravalli, S., Li, J., Absher, D., Srinivasan, B., Barsh, G., Myers, R., Feldman, M., & Pritchard, J. (2009). Signals of recent positive selection in a worldwide sample of human populations Genome Research DOI: 10.1101/gr.087577.108 I pointed yesterday to…
T. Hofer, N. Ray, D. Wegmann, L. Excoffier (2009). Large allele frequency differences between human continental groups are more likely to have occurred by drift during range expansions than by selection Annals of Human Genetics, 73 (1), 95-108 DOI: 10.1111/j.1469-1809.2008.00489.x I've just been…
Kai Wang is a postdoctoral fellow at the Center for Applied Genomics, Children's Hospital of Philadelphia and an author on numerous genome-wide association studies. He left this lengthy comment as a response to my recent post on this comment by McClellan and King in Cell, and I felt it warranted…
Nick Wade in The New York Times has a piece out titled Still Evolving, Human Genes Tell New Story, based on a paper published today in PLOS, A Map of Recent Positive Selection in the Human Genome. This paper is an extension of the research project that emerges out of the International HapMap…

dude, come on - i didn't even cluster them by skin colour. it's just an innocent illustration of the beautiful, clinal rainbow of human genetic diversity...

To introduce new tools like this one is a brilliant idea. Cheers.

By Psychobobas (not verified) on 11 Nov 2008 #permalink

Hi Daniel,

Great post. Mendel's Garden, the genetics blog carnival, is seeking the best genetics posts in the blogosphere. I'm hosting the December edition and am looking for submissions. If you'd be interested in having a post featured, please e-mail me your latest, greatest to chris (at) afreeman (dot) org.

Chris

A nice post, I look forward to more in the series.

Thanks Daniel!

This is great! I'll have to give it a try.