Where to look for regulatory variants

One of the major challenges of the personal genomic era will be knowing exactly which (if any) of the millions of genetic variants present in your genome are likely to actually have an impact on your health. Such predictions are particularly problematic for regulatory variants - genetic changes that alter the expression levels of genes, rather than the sequence of the protein they encode. A paper out in PLoS Genetics this week goes some way towards solving this problem by giving researchers a much better idea of exactly where they need to look for these variants.

The paper
The paper draws on a previously published data-set consisting of the expression levels of over 14,000 genes in 210 human cell lines used for the HapMap project. The use of HapMap cell lines, which have publicly available information on over 3 million variable sites throughout their genomes, has made this data-set an exceptionally powerful resource for finding genetic variants that influence gene expression levels.

In this study the authors set out to determine exactly where these expression-altering variants mapped relative to the genes they affected. For simplicity they focused on expression-altering variants found within 500,000 bases of the gene itself (so-called cis variants); gene expression can also be altered by variants in much more distant regions, but these are much more difficult to identify in practice and are thought to be substantially less common.

The study involves some fairly detailed analysis, which you can read about yourself through the magic of open access - but here's the figure I think is the most interesting:


i-f76bf6c55ef0da2dccf2fcb40a8f8968-veyrieras_fig4.jpg

I've relabelled it a little bit for clarity, but it still needs some explanation. Firstly, TSS and TES stand for "transcriptional start site" and "transcriptional end site" respectively - loosely, the beginning and the end of the gene. In this figure the authors are summarising data from the start and end sites of 11,446 genes, mapped onto a single gene model (summarised at the very topof the image). In all of the panels the areas inside the gene are shown in green, while the areas outside the gene are black.

Part A of the figure shows the distribution of the genetic variants found to influence gene expression (formally, this graph plots the probability that a variant in a particular region will affect gene expression). These variants were typically found either within or close to the gene itself, with less than 7% found more than 20,000 bases away from the gene they influence. But most importantly, the variants cluster strongly within particular areas: there is a strong and symmetrical region of enrichment around the TSS, and a strikingly asymmetrical enrichment around the TES with many more variants inside the gene than outside it.

Importantly, these two regions of genes also tend to be highly conserved across evolutionary time-scales. Part B of the figure shows the average number of base changes observed at each site across seven mammalian species, and you can see marked dips in substitution rates that match remarkably well with the peaks in the distribution of expression-altering variants. In other words, the most evolutionarily conserved regions are also the most likely to harbour variants that influence gene expression levels.

The association between effects on expression and evolutionary conservation is not a coincidence, of course - presumably these regions have been tightly constrained across evolutionary time precisely because changes in these areas can have a marked effect on gene expression (which will usually be deleterious, and thus rapidly purged by natural selection).

The authors go on to explore possible mechanisms for the observed enrichment. The peak around the TSS is readily explicable since it corresponds to a peak in the binding of many important transcription factors (proteins that regulate gene expression). The dramatic, asymmetrical spike at the TES is somewhat harder to explain, but the rapid drop-off beyond the end of the gene suggests that this corresponds to effects on the RNA molecules made from the gene rather than processes acting at the DNA level. The authors argue that variants in this region probably act through effects on the stability of RNA, a process that is much less well-characterised than the regulation of RNA production.

(As an aside: the strong signal at the TES is certainly the most surprising finding from the study for me, but I'm not that familiar with the area - I'd be interested to hear if any RNA biologists in the audience would have predicted the magnitude of this finding in advance.)

One of the important caveats noted by the authors is that the genetic variation data here are not complete, but rather represent the biased sub-set of genetic variants assayed by the HapMap project (with the primary bias being towards common rather than rare variants). That means that in many cases the actual variant responsible for the expression change hasn't been examined yet, reducing the power of this study - and indicating that analyses of high-coverage sequence data will yield more powerful insights into the genetic control of gene expression. Such an analysis can't be far away given that rough whole-genome sequence data for all of these individuals and high-coverage sequence of some of the regions will soon be generated as part of the 1000 Genomes Project.

Implications for personal genomics
i-95b1bea38b4d1a6086935051801062e6-needle_haystack.jpgThe era of cheap whole-genome sequencing is now rushing at us with astonishing speed, and a non-trivial proportion of those reading this post will probably have at least a rough draft of their own genome sequence within five years. However, turning those sequences into useful medical information - in other words, figuring out which of the genetic differences between people explain differences in disease susceptibility - will take a lot longer than that.

For common variants the problem of assigning function is relatively trivial, at least in theory: these can be picked up on a by current genome-wide association studies, and if researchers consistently see a variant more frequently in disease patients than in controls it's likely to be a risk variant. Unfortunately that approach starts to break down with risk variants that are individually rare, being present in less than 1% of the population. The power of current methods to find rare variants is exceptionally low, and even with whole-genome sequencing around the corner the challenges remain profound.

That means that one of the primary tasks now facing the field of personal genomics is figuring out which of the tens of thousands of rare variants in a person's genome actually do anything. In practice that will require algoithms to predict function de novo. This is problematic enough for variants found in protein-coding regions, but at least the problem here is relatively well-defined. For variants within the 98% of the genome that doesn't directly code for protein the challenge is even more daunting: we have only the sketchiest idea of which of these regions are even functional, let alone what they actually do. Yet non-coding variants that alter gene expression levels could influence disease risk just as easily as protein-altering variants, so it will be crucial to come up with ways of assigning them a probability of being functionally relevant.

This paper is a small but important step towards this goal. Although the study doesn't help researchers determine precisely which variants alter gene expression, it does help to constrain the areas where they should be looking the hardest - both by highlighting the importance of location relative to gene structure, and also by confirming the association with evolutionary conservation levels and probability of altering expression. When you're hunting for risk variants in a genome as big as ours, anything that narrows the search area is extremely helpful.

Exactly how we can transform constraints on the search space into information on new genes for common diseases is a topic I'll hopefully be covering in detail over the next couple of weeks.

Jean-Baptiste Veyrieras, Sridhar Kudaravalli, Su Yeon Kim, Emmanouil T. Dermitzakis, Yoav Gilad, Matthew Stephens, Jonathan K. Pritchard (2008). High-Resolution Mapping of Expression-QTLs Yields Insight into Human Gene Regulation PLoS Genetics, 4 (10) DOI: 10.1371/journal.pgen.1000214

More like this

I was planning to write a long article on this recent paper in PLoS Genetics, but p-ter at Gene Expression and G at Popgen ramblings have both covered the central message very well. So if you haven't read those articles, already, go and do so now - when you come back, I want to talk about the…
Well, it's a little late, but I finally have a list of what I see as some of the major trends that will play out in the human genomics field in 2009 - both in terms of research outcomes, and shifts in the rapidly-evolving consumer genomics industry. For genetics-savvy readers a lot of these…
A paper just released in the Lancet describes a thorough and integrated approach to squeezing as much clinically relevant information as possible out of a genome sequence. However, despite a state-of-the-art clinical interpretation pipeline, the major message from the paper is just how far we still…
Jones et al. (2009). Exomic Sequencing Identifies PALB2 as a Pancreatic Cancer Susceptibility Gene. Science DOI: 10.1126/science.1171202 A paper published online today in Science illustrates both the potential and the challenges of using large-scale DNA sequencing to identify rare genetic variants…