The successes of genome-wide association studies (GWAS) in identifying genetic risk factors for common diseases have been heavily publicised in the mainstream media – barely a week goes by these days that we don’t hear about another genome scan that has identified new risk genes for diabetes, lupus, cardiac disease, or any of the other common ailments of Western civilisation.
Some of this publicity is well-founded: for the first time in human history, we have the power to identify the precise genetic differences between human beings that contribute to variation in disease susceptibility. If we can document all of the factors, both genetic and environmental, that result in common disease we will be able to target early interventions to the individuals who are most susceptible. Every GWAS success brings us closer to the long-awaited era of personalised medicine.
But while the media trumpet the successes of genome scans, little attention is paid to their failures. The fact remains that despite the hundreds of millions of dollars spent on genome-wide association studies, most of the genetic variance in risk for most common diseases remains undiscovered. Indeed, some common diseases with a strong heritable component, such as bipolar disease, have remained almost completely resistant to GWAS.
Where is this heritable risk hiding? It now seems likely that it’s lurking in a number of different places, with the fraction of the risk in each category varying from disease to disease. This post serves as a generic list of the dark regions of the genome currently inaccessible to GWAS, with some discussion of the techniques that will likely prove useful in mapping risk variants in these areas.
Alleles with small effect sizes
The problem: The ability to simultaneously examine hundreds of thousands of variants throughout the genome is both the strength and the weakness of the GWAS approach. The power of GWAS is that they provide a relatively unbiased examination of the entire genome for common risk variants; their weakness is that in doing so, they swamp the signal from true risk variants with statistical noise from the vast numbers of markers that aren’t associated with disease. To separate true signals from noise, researchers have to set an exceptionally high threshold that a marker needs to exceed before it is accepted as a likely disease-causing candidate. That reduces the problem of false positives, but it also means that any true disease markers with small effects are lost in the background noise.
The solution: This seems to be one problem that will need to be solved, at least to some extent, with sheer brute force. By increasing the numbers of samples in their disease and control groups researchers will steadily dial down the statistical noise from non-associated markers until even disease genes with small effects stand out above the crowd. As the cost of genotyping (and sequencing) tumbles ever downward such an approach will become more and more feasible; however, the logistical challenge of collecting large numbers of carefully-ascertained patients will always be a serious obstacle.
The problem: Current genome scan technology relies heavily on the “common disease, common variant” (CDCV) assumption, which states that the genetic risk for common disease is mostly attributable to a relatively small number of common genetic variants. This is largely an assumption of convenience: firstly, our catalogue of human genetic variation (built up by efforts such as the HapMap project) is largely restricted to common variants, since rare variants are much harder to identify; and secondly, chip-makers have restrictions on how many different SNPs they can analyse on a single chip, so the natural tendency has been to cram in the high-frequency variants that capture the largest proportion of genetic variation per probe. There is also some theoretical justification for this assumption based on models of human demographic history, but these models are themselves based on numerous assumptions, and the argument may not apply equally to all common human diseases.
In any case, everyone agrees that some non-trivial fraction of the genetic risk of common diseases will be the result of rare variants, and the latest results from GWAS in a variety of diseases have failed to provide unambiguous support for the CDCV hypothesis. Whatever the proportion of variance that turns out to be explained by rare variants, current GWAS technologies are essentially powerless to unravel it.
The solution: Increasing sample sizes may help a little, but the fundamental problem is the inability of current chips to tag rare variation. Short-term, the solution will be higher-density SNP chips incorporating lower frequency variants identified by large-scale sequencing projects like the 1000 Genomes Project. However, such approaches will have diminishing returns: as chip-makers lower the frequency of the variants on their chips, the number of probes that will have to be added to capture a reasonable fraction of total genetic variation will increase exponentially, with each new probe adding only a minute increase in power.
Ultimately, the answer lies in large-scale sequencing, which will provide a complete catalogue of every variant in the genomes of both patients and controls. The problem here is not so much the sequencing itself – the costs of sequencing are currently plummeting due to massive investment in rapid sequencing technologies – but in the interpretation. Whole new analytical techniques will be required to convert these data into useful information.
The problem: Over the last 50 to 100 thousand years modern humans have enthusiastically colonised much of the world’s landmass. Each wave of expansion has carried with it a fraction of the genetic variation of its ancestral population, along with a few novel variants acquired through mutation. In each new habitat encountered, natural selection has acted to increase the frequency of variants that provided an advantage, and cull those that were harmful, while the rest of the genome passively gained and lost genetic variation. The end result is a set of human populations that, while extremely similar across the genome as a whole, can carry quite different sets of genetic variants relevant to disease. In addition, the correlation between markers close together in the genome (known as linkage disequilibrium) can also differ between populations, so that a marker that is tightly correlated with a disease variant in one population may be only weakly associated in other groups.
These differences have profound implications for disease gene mapping efforts. As a result of this variation, markers that are associated with disease in one population can never be assumed to show the same associations in other human groups (this will be especially true for rare variants, of course). Current GWAS have been dominated by subjects of Western European ancestry, and our understanding of genetic risk variants in non-European populations is almost non-existent. In addition, these differences mean that mixing people with different ancestries together in a disease cohort can seriously confound the identification of causative genes – in certain situations, such mixing can greatly increase the risk of false positive findings.
The solution: For GWAS results to be universally applicable, they will need to be performed in cohorts from a wide range of populations. Data-sets such as the HapMap project, the Human Genome Diversity Panel and the powerful new 1000 Genomes Project will provide information about the patterns of genetic variation in diverse populations that is needed to design the assays for GWAS. A greater challenge will be collecting the large numbers of ancestry-homogeneous samples – both well-validated disease patients and healthy controls – required for GWAS approaches to be successful. This problem is likely to be particularly acute for African populations, where linkage disequilibrium is lower and genetic diversity much higher than in other regions (thus requiring larger numbers of markers and individuals to identify disease variants); and of course, in Africa and much of the rest of the world, local governments typically have much more pressing issues than genome scans to spend their limited health budgets on.
The problem: Most current genetic approaches assume that genetic risk is additive – in other words, that the presence of two risk factors in an individual will increase risk by the sum of the two factors by themselves. However, there’s no reason to expect that this will always be the case. Epistatic interactions, in which combined risk is greater (or less) than the sum of the risk from individual genes, are difficult to identify with genome scans and even harder to untangle. If epistasis is strong, then just a few genes – each with a weak effect by itself, well below the threshold of a scan – could in concert explain a large chunk of genetic risk. Such a situation would be largely invisible to current approaches.
The solution: Large sample sizes, and clever analytical techniques. I’m not going to attempt a more detailed answer as this area is well outside my knowledge zone – but fortunately, it’s an active area of research (see, for instance, the Epistasis Blog). I’d welcome any comments from people who know more about epistasis than I do about the likely scope of this problem and the methods that will be used to resolve it.
Copy number variation
The problem: One of the great surprises of the last five years has been the discovery of widespread, large-scale insertions and deletions of DNA, known as copy number variations (CNVs), in even healthy genomes. CNVs are now known to account for a substantial fraction of human genetic variation, and have been shown to play a role in variation in human gene expression and in human evolution. It seems highly likely that CNVs will be responsible for a non-trivial proportion of common disease risk.
However, our understanding of these variants is still in its infancy. The chips currently used in GWAS, which interrogate single base-pair variations between individuals known as SNPs, can be used to detect a small proportion of CNVs indirectly (by looking for distortions of signal intensity or inheritance patterns), and may effectively “tag” a fraction of the remainder (by using SNPs that are very close to the CNV, and therefore tend to be inherited along with it). However, the vast majority of copy number variation remains invisible to current GWAS technology.
The solution: High-resolution tiling arrays – chips containing millions of probes, each of which binds to a small region of the genome – can be used to explore CNVs in some areas of the genome, but they break down for the large fraction of the genome containing repetitive elements. Ultimately, the complete detection of CNVs from patients and controls will require whole-genome sequencing, preferably using methods with much longer read lengths than the current crop of rapid sequencing technologies.
The problem: Not all inherited information is carried in the DNA sequence of the genome; a child also receives “epigenetic” information from its parents in the form of chemical modifications of DNA that can alter the expression of genes – and thus physical traits – without changing the sequence. Although epigenetic inheritance is known to occur, the degree to which it influences human physical variation and disease risk is essentially totally unknown.
All existing technologies used in GWAS are based on DNA sequence, and thus don’t detect epigenetic variation. It is even invisible to full-genome sequencing.
The solution: It first needs to be established that epigenetically inherited variations do actually contribute a non-trivial fraction of human disease risk. If so, techniques currently being developed to identify these variants in a high-throughput fashion could be used to perform EWAS (epigenome-wide association studies).
The problem: Some “diseases” are actually simply collections of symptoms, which may stem from multiple, distinct genetic causes. Lumping patients with fundamentally different conditions into a single patient cohort for a GWAS is a recipe for failure: even if there are strong genetic risk factors for each one of the separate conditions, each of these will be drowned out by the noise from the other, unrelated diseases. The problem is that for some diseases – particularly mental illnesses, where causation lurks deep within the complex and poorly-understood human brain – the knowledge and tools required to separate patients into distinct sub-categories simply may not exist yet.
The solution: The geneticists can’t fix this one – it will take a combined effort from clinicians and medical researchers to break down complex diseases into useful diagnostic categories, which can then each be subjected to separate genetic analysis. In the cancer arena, conditions previously lumped together as one entity have now been separated using new technologies such as gene expression arrays; similar approaches will no doubt prove fruitful in a range of other diseases, although the inaccessibility of brain tissue will make it more difficult to apply such approaches to mental illness.
The future of genetic association studies
Current chip-based technologies for genome-wide analysis, while having some success in identifying the lowest-hanging genetic fruit for many common diseases, seem to have already started to run up against barriers that are unlikely to be overcome by simply increasing sample sizes. These technologies should really be regarded as little more than a place-holder for whole-genome sequencing, which should become affordable enough to use for large-scale association studies within 3-5 years.
The application of cheap, rapid sequencing technology is likely to generate a harvest of new disease genes that far exceeds the yield of current GWAS, by providing simultaneous access to both the rare variants and copy number variations that are inaccessible to current chip-based approaches. However, building a more complete catalogue of the heritable variants that drive common disease risk will require more than just cheap sequencing: it will also take advances in clinical diagnostics to better sub-categorise patients into homogeneous groups, as well as new and powerful analytical approaches to cope with the torrent of sequence data, and to efficiently identify epistatic interactions between disease variants. To have any chance of picking out variants of small effect from whole-genome sequencing data sample sizes will have to be enormous – massive cohorts currently being assembled, such as the 500,000-person UK Biobank and a similar NIH-funded study currently in the works, will provide essential raw material for the selection of participants. Naturally, to be applicable to humanity as a whole, cohorts will need to be gathered separately from many different human populations.
Finally, epigenetic variation remains a wild-card of uncertain significance, which will need to be tackled with a different set of high-throughput technologies (although it’s likely that many of these will feed on advances in high-throughput sequencing).
Although I probably sound pretty negative about GWAS, I want to emphasise that the current problems are the result of technological limitations that will soon disappear. Barring global catastrophe, within the lifetimes of most of those reading this post we will have a near-complete catalogue of the genetic variants influencing the risk of most of the common diseases that plague the industrialised world (and, hopefully, many of those that plague the rest of humanity). Together with parallel advances in medical science, this catalogue will provide an unprecedented ability to predict, treat and potentially completely eliminate a host of common diseases. It will also bring social and ethical challenges of unprecedented magnitude – but that’s a topic for another post…