Why disease associations outside of genes are not a bad thing

This critique of genome-wide association studies by Jon McClellan and Mary-Claire King in Cell is the latest salvo in a prolonged backlash against genome-wide association studies (GWAS).

I hope to have more on the McClellan and King paper shortly, but in the meantime I will point you to a positive take on the paper by Stephen Turner (read the comments section), and an excellent response to one of M&K's more bizarre criticisms by p-ter at Gene Expression. The claim in question is that the tendency of GWAS to find disease associations outside of protein-coding genes is somehow a problem; but, as p-ter notes, there's perfectly plausible reasons for disease risk variants to be found in non-coding regions.
Indeed, I think most of us working in genomics have seen the proliferation of non-coding hits in GWAS studies as a positive, in that it seems to be teaching us something new and unexpected about the underlying biology of human variation.
Anyway, there's plenty more to be said about the M&K paper - and hopefully I'll have a guest post up in the next day or so inflicting some well-deserved shredding.

More like this

Good - thanks for pointing to the other blogs as well. The MCK paper, more of an opinion piece, seemed shallow in it's evidence. It's an important subject and you can't blow away GWAS with just a couple of lines about a couple of examples (e.g. autism and stratification) - that's journalism and sensationalism

This piece seems to commit three classic logical fallacies:

- The argument from incredulity. I'm pretty sure mutations in BRCA1 had "no demonstrated biological significance" until they were found to contribute to BC risk and followed up.

- The false generalization. Population structure is a problem, therefore all GWAS results are incorrect. This of course ignores replication data, which presumably is biased in an identical way, and the TDT results which agree with case/control findings.

[The authors propose a testable hypothesis to deal with this though: they suggest that all associated common variants are hyper-variable, ergo cryptic stratification. It is a relatively simple matter to look at such SNPs' frequency variation across populations (and measures derived from this, such as F_st) to test the hypervariable hypothesis.]

- special pleading. Distant regulators, sequencing will show rare variants driving phenotypic heterogeneity. "Whole-genome sequencing strategies detect hundreds of thousands of rare variants per individual": of course, since they are unique to this individual, we don't have to worry about error rates, now do we? Iirc non proofreading polymerases have an error rate of ~1e-4.

I'm struggling with this:

First I was annoyed - don't these people know how much effort has been put into trying to minimise biases caused by population structure in GWAS?

The answer seems to be yes, but so what:

Investigators devote a great deal of effort to the problem of population stratification. Subjects who are deemed outliers based on substructure analysis are generally removed from GWAS. However, hypervariable polymorphisms remain vulnerable to stratification even after this adjustment. Strategies to address this problem include using family designs to compare genotypes of cases to their healthy relatives and removing hypervariable SNPs from analyses.

OK, so maybe they haven't noticed (as Chris says) that replication of GWAS findings is often done in family collections.

The main argument, though, is a straw man - if we say a SNP is associated with a disease, do we mean that it is functional, or simply that there is a statistical relationship?

The authors assume that because some GWAS authors mean the former, we all do, and we are (therefore) all wrong.

However, most of us have been taught to say the latter. It shouldn't matter if a risk SNP is a marker of the disease, or of a genuinely hidden sub-population with a raised incidence of the disease. Either way, we've found the people - in this group of samples - with raised risk, and that is what an association is.

But what to do next? According to the authors, with the advent of "cheap" widespread exome/whole-genome sequencing, you don't really need a hypothesis to plan an experiment:

Genome-wide screening for mutations remains the most effective and unbiased way to discover genes involved in complex illnesses.

However, previously the only sensible hypothesis has been - let's assume this associated risk variant is also connected to something functional, but of unknown function. Strangely - as others have pointed out on Stephen Turner's site - this follow-up strategy has sometimes worked ...

I am a long-term reader of Genetic Future (which is even set up in my iGoogle homepage) and I enjoy reading the blogs and the corresponding comments, but I never posted comments here. Now quite a few people mentioned to me about the McClellan et al paper and the related Internet posts about it (including those in Genetic Future). Discussion on at least three diseases in the paper (hearing loss, SCA and autism) cited some of my published papers, and I therefore decided to post my comments on Internet, to set the records straight. Although I whole-heartedly agree that rare variants play a substantial role in human diseases, I also think that the section on GWAS reflects misunderstandings of the concept of GWAS, ignorance of standard practices in GWAS, misinterpretation of published primary research data, and as a result, is misinforming the general readership of Cell. These issues need to be rectified for the good of the scientific community, and for the healthy development of methodology and practice of human genetic research.

For inpatient readers, these are the bullet points: (1) GWAS interrogate disease loci through linkage disequilibrium, so the lack of known biological function on GWAS SNPs does not justify the attack against GWAS by McClellan et al; (2) Methods for adjusting population stratification are well established in the GWAS community; it is not a valid argument to explain most GWAS signals (with odds ratio less than 2) by stratification, especially if family-based study design is used (including the autism GWAS); (3) McClellan et al used rs4307059 (from autism GWAS) as a âparticularly dramaticâ example of stratification because its frequency varies across Europe and it is monoallelic in Africa, which is not scientifically and statistically justified. In fact, it is the nature of SNPs to have differing allele frequencies across populations, and almost half of the SNPs in Illumina array have higher Fst population divergence values than rs4307059 (that is, half the SNPs are more variable than rs4307059 across human populations). Below I elaborate these points more specifically for interested readers.

(1) McClellan et al use the fact that most detected SNPs in GWAS are from intergenic regions to question the utility and the reliability of GWAS, and raised a serious question âHow did genome-wide association studies come to be populated by risk variants with no known function?â. In fact, GWAS do not attempt to identify functional SNPs, but rather identify approximate location of loci that harbor disease variants. This is possible due to the extensive linkage disequilibrium (LD) between segregating sites in a given human population. Most SNPs in SNP arrays have unknown biological function, only because most SNPs in HapMap are outside of coding regions and because manufacturers of SNP arrays usually do not select SNPs by known function. Unfortunately, this fact may not be well known outside of the GWAS community, such as most readers of the journal Cell. McClellan et al did mention LD but they did not recognize that GWAS do not attempt to interrogate causal variants in the first place. More interestingly, they discussed the SCA GWAS and hearing loss GWAS that I published; the hits in both GWAS are actually outside but close to the causal gene (HBB and GJB2), yet they tag exonic variants in the causal gene, representing two particularly vivid and classic examples on how GWAS work through LD. It is unclear how McClellan et al can discuss these two examples extensively by ignoring the basic facts that both non-coding hits indeed faithfully tag the causal variants in causal genes through the magic of LD. For readers not familiar with GWAS, I need to also emphasize that GWAS variants were typically referred to as ârisk variantsâ only because of convention of published literature, not because they are the actual functional variants that confer risk. Unlike what some readers may think based on McClellan et al, 100% of Africans carry a risk allele does not suggest that all subjects of African descent are predisposed to risk; it merely suggest that LD patterns in European and African populations at a locus are different. One cannot interpret GWAS results without acknowledging these basic facts.

(2) McClellan et al erroneously attributed many published GWAS hits as caused by population stratification, as if GWAS used similar strategies as candidate gene association studies. Without any scientific support, they even claimed that âan odds ratio of 3.0, or even of 2.0 depending on population allele frequenciesâ would be robust to be interrogated in GWAS. In fact, the beauty of whole-genome SNP data is that inflation of test statistics due to population substructure can be identified and adjusted. Populations do not differ in one or two SNPs; they differ in many loci and that explains why whole-genome data helps identify stratification, and several recent studies already show how extremely fine-scale sub-populations in Europe can be separated by whole-genome data. The GWAS community has established methods to deal with population stratification and these methods are fairly effective for common variants without any controversy in the field. There are certainly some challenges on analyzing rare variants or recently admixed populations, and these are research topics that we are actively studying. McClellan et al failed to inform readers of the standard practices of genomic control, EigenStrat, multi-dimensional scaling or many dozens of other approaches for addressing stratification, which are now commonly used in case/control GWAS. Furthermore, family-based study design in GWAS has the advantage of protecting against stratification, which should be emphasized to readers. For example, McClellan et al attacks our autism paper as false positive due to population stratification, but our paper is largely driven and replicated by family-based cohorts, not case/control cohorts. Therefore, their general claim lacks scientific support, ignores massive amounts of work by the statistical genetics community in developing stratification adjustment methods, and reflects unrealististic speculation and unfamiliarity with standard GWAS practices.

(3) McClellan et al mistakenly treated GWAS hits as âfalse positiveâ, if their allele frequencies vary across European populations or HapMap populations. The allele frequency variation for ANY (I mean it, ANY!) SNP across populations is not something that should be surprising to researchers with substantial GWAS knowledge. Of course, it is the very nature of ANY SNP to have variable allele frequencies across human populations, so that Asians, Caucasians and Africans differ from each other. I have no idea what McClellan et al are surprising about, as they probably thought that most SNPs should have similar allele frequencies in all populations. Specifically, they treated the SNP rs4307059, reported by us to be associated with autism, as a âparticularly dramatic example of the perils of cryptic population stratificationâ. Their reasoning on âstratificationâ is that the frequency of the proposed risk variant varies from 0.21 to 0.77 across European populations and that it is monomorphic in African populations. In reality, the allele frequency of rs4307059 is fairly consistent among large cohorts of European Americans (MAF=39%), WTCCC (MAF=38%), POPRES British (MAF=39%), POPRES Spanish (MAF=37%). In HGDP data, I did confirm that the allele frequency differ in Tuscany (MAF=75% in 7 samples, yes you read it right, SEVEN) and Orcadian (MAF=25% in 15 samples), but readers should be aware that frequency estimate depends on the sample size (seriously, mathematically, what would you expect from 7 or 15 samples, and how much do these two populations contribute to genes in European Americans?). Furthermore, assuming that allele frequency measures are indeed accurate, if we want to do science rigorously, we need appropriate control experiments, so let us compare this SNP with others in the same genomic region: there is no any evidence of increased population differentiation for this particular SNP in 2Mb genomic region across human populations (chr5:25500000..26499999 in the HGDP browser http://hgdp.uchicago.edu/cgi-bin/gbrowse/HGDP/). Finally, if we examine the SNP in the context of the whole genome, based on HGDP browser, we can see that 44% of SNPs (-log(0.44)/log(10)=0.35 for rs4307059 in the âFstâ track, raw data http://hgdp.uchicago.edu/Browser_tracks/FST/) in the Illumina array have a more extreme Fst values than this SNP, so about half of the SNPs have stronger population divergence than this SNP. One cannot just take a random SNP from the MIDDLE of a ranked list and claims it as âparticularly strikingâ example of population stratification. Any such claim needs to be made in the context of comparative analysis with other SNPs, otherwise it is not a scientifically rigorous practice and serves a purpose solely to misinform readers outside of the field.

(4) McClellan et al mistakenly interpreted the hearing loss GWAS and SCA GWAS that we published in PLoS Biology. Interestingly, they even have a somewhat opposite interpretation of the primary research data presented in our paper: our original purpose is to demonstrate how rare variants may contribute to human diseases (and may show up in GWAS through LD with common SNPs in Illumina arrays), so our paper should really be interpreted as supporting the arguments for studying rare variants in their paper. For readers, I need to clarify that SCA is a classic example of heterozygosity advantage in any genetic textbook, and our study demonstrates how rare alleles under balancing selection can show up in GWAS. On the other hand, hearing loss is known to be caused by many genes but the major cause is GJB2 mutation, so the GWAS demonstrates that moderately rare alleles (MAF=1.2%) can be picked up by GWAS without balancing selection. I simply do not understand what they are trying to get by âhad inherited hearing loss been investigated in a region where it is more common (e.g., in the Middle East), â¦â¦â, as any GWAS should be focused on a specific ethnicity group, and I cannot just combine Caucasians with Middle East people together and of course this will dilute the signal in GWAS. Why would I even bother to apply GWAS âin heterogeneous populations of common diseasesâ at all, as suggested by McClellan et al, when the very power of GWAS comes from examination of LD? I do not understand how they can take the exactly same results and re-interpret the data and get a drastically different interpretation from the data.

(5) McClellan et alâs interpretation of the autism locus is wrong. McClellan et al utilized this as an example of âfalse positiveâ, without any valid scientific evidence (differences of allele frequencies in Tuscany and Africans does NOT suggests false positive in European Americans!). Another study (Weiss et al) cited by McClellan et al was not able to garner evidence for this SNP, but the study has very small non-overlapping sample size and therefore little power to âreplicateâ loci with moderate effect sizes. Furthermore, Weiss et al used family-based association test (TDT test), so there is no comparison of case/control allele frequencies as mentioned by McClellan et al. I seriously doubt whether McClellan et al actually read either paper carefully, otherwise I do not see where a gross mis-interpretation of primary research data could come from. Due to power issues and sample comparability issues, Weiss and Arking (both are nice people who I know) faithfully described their research results in the paper without comments, yet McClellan et al mistakenly interpolate these primary results without scientific support and attach a âfalse positiveâ label that completely misled the scientific community. On the other hand, McClellan et al failed to mention another companion study identifying this same locus purely by family-based cohorts (Annals of Human Genetics). In addition, a paper in press shows that the SNP also functions as a quantitative trait locus for autistic traits in ~8000 children in a single UK city born at the same year, which pretty much blows away any concern on stratification in case/control studies. For me, these are compelling evidence that population stratification does not explain the signal, though I think that functional studies are certainly necessary to identify causal variants and to study their roles. In summary, their criticism on the autism locus lacks any rigorous scientific support whatsoever, and can probably be better explained by non-scientific reasons.

I will send a shortened version of my comments to Cell. I cannot predict what will be the outcome of this appeal, but I would appreciate comments from readers of this post and I will try to address them. I wonder what is the appropriate balance between academic freedom and scientific responsibility for researchers to make comments on subjects outside of their expertise in the absence of rigorous scientific support; I also wonder what is the appropriate standard for basic fact checking for journals to publish especially strong claims, even for non-research articles (essays/commentary/review), and what is the appropriate response from well-respected journals to recognize and rectify these mistakes. Let us wait and see.