Note: I’m introducing Do It Yourself as a new and hopefully semi-regular section on Genetic Future. The aim is to provide readers with instructions on how to access online resources for sequence analysis – an activity traditionally restricted to researchers, but one that will no doubt become more common as more and more people begin to access and interpret their own genetic data.
In this post I’ll introduce the brand new HGDP Selection Browser, a tool for exploring traces of recent positive selection in the human genome produced by researchers at the University of Chicago.
Introduction: the traces of positive selection in the human genome
Over the last 50-100,000 years the human species has successfully spread from its African homeland to colonise nearly every corner of the globe, including environments ranging from desert to tundra. Adaptation to these diverse environments, along with major dietary changes, rapid population growth and exposure to a range of novel infectious diseases, has left its mark on the human genome. Essentially, all of us carry a molecular record of our ancestors’ adaptation – and with the recent growth of databases of human genetic variation, we can actually determine (albeit imperfectly) which genes played a role in this process.
Genetic variants that offer a benefit to the individuals who carry them – such that they have, on average, more surviving offspring than non-carriers – will tend to increase in frequency in a population through positive natural selection. This relatively rapid increase in frequency has a substantial impact on the region of the genome immediately around the selected variant, resulting in what’s known as a “selective sweep”: a local reduction in genetic diversity, and an elevation of long-range linkage disequilibrium.
(You can think of these signatures as being a consequence of a selected variant being younger (on average) than a neutral region at the same frequency, due to its rapid increase in frequency. The low genetic diversity and high linkage disequilibrium result from the fact that there hasn’t been much time for mutation and recombination, respectively, to act on the section of the genome closely linked to the selected variant.)
If the selection is restricted to just one or a few populations (due to a specific environmental pressure, or simply a lack of the beneficial variant in other populations) there will also be an increase in population differentiation in the region; in other words, in this portion of the genome, human populations will tend to look more different to one another than they do in other areas.
These classic signatures of a selective sweep have been used to hunt for genes subject to recent positive selection in humans by many groups, taking advantage of genetic variation data from the HapMap project and various other sources. The results of these analyses have been used to argue that recent human evolution has been characterised by pervasive, often population-specific positive selection, presumably resulting from the adaptation of modern humans to diverse, novel environments outside of Africa.
What’s been lacking in most of these studies is analysis of a broad range of human populations – most attempts, by necessity, have been restricted to the European, East Asian and West African populations samples by the HapMap project. That’s changed now with the arrival earlier in the year of data on 650,000 genetic markers (SNPs) in 938 individuals from over 50 populations, using DNA samples from the Human Genome Diversity Panel.
These genome-wide data have now been analysed for various signatures of recent selection by a team at the University of Chicago – and while the publication isn’t out yet, the team has generously made their data available through a nifty online browser. It’s worth noting that the same group previously produced the Haplotter browser, which allows you to examine signatures of selection in the four HapMap populations (for an introduction to Haplotter and other online resources, see the So you want to be a population geneticist? post on GNXP).
Using the HGDP Selection Browser
If you want to see if your favourite gene shows a signature of population-specific positive selection, here’s how the browser works:
Enter the name of a gene or a SNP into the window at the top left of the browser and it will take you to that region of the genome. You can then use the “Scroll/Zoom” function to move around or view a specific window size ranging from 100,000 to 5 million bases. Here I’ve taken a snapshot of a 1 Mb (1 million base) region around the EDAR gene – I find that a 1 Mb window size is a pretty good default for getting an idea of how the region looks:
EDAR is a classic example of recent population-specific selection in humans, emerging as a major East Asian-specific outlier in several large-scale studies of the human genome. (The reason for the selection is still unclear, but a common protein-altering variant in EDAR is associated with variation in hair morphology and affects cellular signalling).
The picture from the browser is thoroughly consistent with a selective sweep restricted largely to East Asia: many of the markers within or close to the EDAR gene show unusually high Fst (a measure of population differentiation), and the region as a whole displays elevated iHS and XP-EHH scores (both measures of extended linkage disequilibrium, with XP-EHH being particularly powerful for detecting population-specific selection) that are most dramatic in the East Asian population (green). (For all three measures the value plotted is the -log10 of the empirical P value for each SNP – so the higher the score, the more extreme an outlier that region is relative to the rest of the genome.)
By clicking on individual SNPs in the “Genotyped SNPs” row of the browser you can generate a map of the distribution of frequencies in the various surveyed populations – here’s the map for one of the more interesting-looking variants in EDAR:
You can see immediately that the more evolutionarily recent or “derived” version of this SNP is restricted mainly to East Asia and the Americas (where the native populations are relatively recent migrants from East Asia). That makes sense for a region where selection appears to have been almost entirely restricted to East Asian populations.
To drill down a bit further, click on the buttons next to “Haplotype plots” (towards the bottom of the page, just above the display settings) – you can either look at populations individually, or grouped together into 7 continental clusters. After some calculations (based on an algorithm from this paper) you end up with some rather complicated plots that look like this:
These plots take a bit of getting used to, but here’s the gist: each row in the plot represents an individual chromosome in the sample (for most genes that means two rows per individual), and sections of the row are coloured depending on whether they share the same combination of variants (also known as a haplotype). A variant that has been a target of recent positive selection should show a long, largely uninterrupted haplotype at high frequency in the population – and that is precisely what you see for the green haplotype in East Asians (the top panel). In the Middle East and South Asia (bottom two panels), where there has been no selection on the EDAR variant, there is a chaotic arrangement with no sign of a dominant, long-range haplotype – in other words, the sort of pattern you would expect under neutral evolution.
I’ve included massively expanded versions of the East Asian and Middle Eastern haplotype plots at the end of this post – you can obtain these expanded plots simply by clicking on the relevant graph.
Has your favourite gene been recently selected?
I asked a member of the team who put the browser together what he saw as the basic rules of thumb for deciding whether or not a region was a target of recent selection. He noted that the process is still more of an art than a science (which is understandable given the limitations and complexities of the data, and the fact that weakly selected variants will look essentially identical to unselected regions). However, he has this advice:
I’m most convinced if there’s a strong (ie. empirical p in the top 1%, which is a 2 on the scales used in the browser) iHS or XP-EHH score and the haplotype plots look “good”–there’s a long haplotype at high frequency in the populations with the significant score and not elsewhere. A big Fst also contributes to making the evidence stronger.
There are no absolute criteria distinguishing selected from neutral sites with perfect accuracy, so interpret with caution – but if you already have reason to suspect that a genetic variant might have been a recent target of selection (e.g. that most nebulous form of evidence, “biological plausibility”), this browser can give you a quick sense of whether this might be a possibility worth pursuing further.
A manuscript based on the data presented in this browser is currently in the works, with findings that I think will present a challenge to many in the human evolutionary genetics community – more on that later.
Here’s those expanded haplotype plots for EDAR, showing strong evidence for selection on the green haplotype in East Asians, while individuals from the Middle East show no trace of recent selection:
East Asia (green haplotype under selection)
Middle East (no strong recent selection)
Face images modified from thumbnails at Face Research.