Guest post: Neil Walker on the curious case of the schizophrenia GWAS

Purcell et al. (2009). Common polygenic variation contributes to risk of schizophrenia and bipolar disorder Nature DOI: 10.1038/nature08185



Neil Walker has been doing a spectacular job of serving up useful information in the comments recently, so I asked him to write the first ever guest post on Genetic Future - something that (as I will be announcing shortly) I intend to do fairly regularly over the next couple of months.

The topic is a paper that has created a rather perplexed buzz recently in the complex disease genetics community: the genome-wide association study (GWAS) for schizophrenia published in Nature last week. This paper takes a novel and (at first glance) rather alarming approach to exploring the genetic basis of this complex disease, so I asked Neil to provide some insight into what he thought about the approach used in this paper and what it means for complex disease genetics.

Without further comment, I present Neil's post:



Common polygenic variation contributes to risk of schizophrenia and bipolar disorder

If you've not read this recent Nature paper from the International Schizophrenia Consortium, a quick summary of findings is available (from the The Mental Health Social Worker), as Schizophrenia and Bipolar Disorder Share Genetic Roots.


Perhaps it went something like this?

[Scene: Anonymous (poorly furnished, chaotic and paper-filled) academic offices round the world. A teleconference at some ungodly hour to accomodate the Swedes, British, Irish, Portugese and Americans. Groups gathered around speaker phones, waiting for copies of slides to come in by email.]

Chair: Right, welcome to the umpteenth meeting of the International Schizophrenia Consortium Management Committee [1], and if we're all here, let's get straight to the standing agenda item: do we have anything to show for our $5 million GWAS [2] Shaun? [3]

Shaun Purcell: Well we got 2 hits at genome-wide significance [4] - one in some new gene I've never heard of on chromosome 22 - MYO18B - and a whole bunch in the MHC [5]

All: [groans]

Chair: But we did at least replicate the previously identified regions?

SP: Not entirely [6]

Chair: What about CNVs? Imputation?

SP: Nothing new [7]

Chair: OK. Well, the preliminary data suggested as much. You'll all recall that you asked me to get in touch with the Molecular Genetics of Schizophrenia and SGENE consortia [8] to see if they'd play ball in a meta-analysis ...

All: [groans]

Chair: ... and while we couldn't agree to share the raw data, we got hold of summary data for SNPs showing a trend - 10^-3. So Shaun, with 8,000 cases and 19,000 controls, there must be something, or we can kiss goodbye to the genetics of schiziophrenia ...

SP: Nothing. The MHC SNP barely replicated. But let's not panic - I've got a cracking new idea ...



And that idea is this paper.

After a page of preliminaries (as crudely characatured above), the paper kicks off:

Our second approach was to evaluate whether common variants have an important role en masse, directly testing the classic theory of polygenic inheritance, previously hypothesized to apply to schizophrenia. Although our GWAS analysis did not identify a large number of strongly associated loci, there could still be potentially thousands of very small individual effects that collectively account for a substantial proportion of variation in risk. We summarized variation across nominally associated loci into quantitative scores, and related the scores to disease state in independent samples.

The basic thesis is this: if you take the good quality independent SNPs in a GWAS, and divide them in 2 on the basis of association results in half of your GWAS sample, then if the more associated SNPs in that half are also the more associated SNPs in the other half of the sample, that means something.

The rest of the paper, and the exemplary 46 pages of Supplementary Information - while reading a little like an extended advert for Shaun Purcell's PLINK - are an attempt to convince first the authors, then the Management Committee, then the reviewers, and now us, that this is not just a nasty fudge - making a virtue of necessity - but a genuine finding.

Other bloggers/journalists were (much) quicker off the mark:

  • Whalefall's A Schizophrenia Gene Debacle
    From a journalistic perspective, there are two possible stories here. First, the straight story: schizophrenia is extraordinarily complicated, and genetics can't now explain it in any useful way. And two, the contextual angle: for years, the public has expected, and scientists have sometimes promised, that genetics would illuminate this disease - and it failed, just as it has for nearly every disease.

    which quotes:

  • Nicholas Wade's Hoopla, and Disappointment, in Schizophrenia Research
    The journal Nature held a big press conference in London Wednesday, at the World Conference of Science Journalists, to unveil three large studies of the genetics of schizophrenia. Press releases from five American and European institutions celebrated the findings, one using epithets like "landmark," "major step forward," and "real scientific breakthrough." It was the kind of hoopla you'd expect for an actual scientific advance.

    It seems to me the reports represent more of a historic defeat, a Pearl Harbor of schizophrenia research.

    The defeat points solely to the daunting nature of the adversary, not to any failing on the part of the researchers, who were using the most advanced tools available. Still, who is helped by dressing up a severely disappointing setback as a "major step forward"?

    The principal news from the three studies is that schizophrenia is caused by a very large number of errant genes, not a manageable and meaningful handful.

    [...]

    In the last few years gene hunters in one common disease after another have turned up a few causative variant genes, after vast effort, but the variants generally account for a small percentage of the overall burden of illness. With most common diseases, it turns out, the disease is caused not by ten very common variant genes but by 10,000 relatively rare ones.

    (This last being David B. Goldstein's view)

    which is in turn quoted by:

  • Fists Full of Science's Missed opportunity - not debacle - in bogus schizophrenia genes coverage
    Given that researchers had been looking for meatier schizophrenia genes for years and years without finding anything substantial, this was to be expected, especially if you're the kind of person who questions whether there are very many genes "for" anything.

    It then drifts off into being another story about poor science coverage - we've had a lot of those lately - although it shows signs of life in the what's-it-all-for stakes, which in turn shows up in:

  • Wiring The Brain's Hot new in the genetics of schizophrenia

    which basically suggests geneticists are looking in the wrong place, and its all about rare variants


Why add to this weight of verbiage?

Well, first, Daniel and I just didn't believe the result at all - and it's taken detailed reading to get beyond that.

Second, the paper ends, bullishly:

We identified fewer unambiguously associated variants than studies of some non-psychiatric diseases of comparable size. Nonetheless, for other diseases replicated variants typically account for only a modest fraction of risk. The nature of this "missing heritability" is a general problem now faced by complex disease geneticists. For schizophrenia, our data point to a genetic architecture that includes many common variants of small effect. The extent to which similar models characterize genetic variation within and across other complex diseases remains to be investigated.

which deserves some response - even if only I don't think we'll bother!

So, a few points:

1. "polygenic" doesn't mean anything interesting. Its usage here comes from the pre-history of molecular biology (1967), when researchers began to fail to find single gene causes of heritable diseases and wanted a new name for their disappointment. It is used in contrast with "monogenic" - but no-one believes other complex, multifactorial diseases are driven by single, highly penetrant genes, and not many people (Goldstein apart) think that complex disorders are driven by a series of rare highly penetrant variants. We'll know soon enough - see prediction 1.

2. News reports that wrote this up as 30,000 SNPs are associated with schizophrenia are wide of the mark, and specifically rejected by the paper:

We use the term score, instead of risk, as we cannot differentiate the minority of true risk alleles from unassociated variants.

- this is just top half vs bottom half of associations.

3. the most obvious reason why 2 halves of a GWAS would look alike in their associations, is that the case and control samples, or case and control subjects differ in some way other than by the disease status of the subjects. In particular DNA quality and/or preparation could lead to the genotypes being scored differently - a particularly pernicious problem is non-random missingness - or the case and control subjects could be poorly matched - by geography or ethnicity.

With regards to quality control, the Supplementary Information (section S1) goes into elaborate detail as to what was done to weed out suspect samples and SNPs. The attempt to merge data across 3 generations of Affymetrix chip has led the SNP QC to be too complicated to be entirely convincing, but when it came to generating the master set of SNPs used in the big "score" analysis, extra cutoffs were used - genotyping call rate >= 99% (which meant the SNPs needed to be on all chip types), MAF >= 2% and low LD (r^2 <= 0.25 in 200-SNP sliding window) - which is all fine.

Samples were removed both if they were too similar, or too different from other samples - which should take care of both overrepresentation of extended family pedigrees, and subjects who do not match ethnically.

Some population stratification is expected, and an analysis was performed to control for the multiple sites used, leaving a no-bigger-than-expected "genomic inflation figure" of 1.09 - the WTCCC collections reported on the range of 1.03-1.11, with the comment that "overall effect of population structure on our association results seems to be small, once recent migrants from outside Europe are excluded." At the time, this was considered somewhat surprising.

4. without the replication in other disease collections - and with no evidence whatsoever - I'd still be suspicious that Shaun and co had come up with some new way of generating an artifactual result. It is plain others didn't believe it either, as there are 7 pages of Supplementary Information - S13, pp24-30 - on "Addressing population stratification and other possible confounders".

OK, so the result is real. What does it mean?

At heart, this is a stats paper. There is an excellent discussion, from data simulation experiments, of what magnitude of associations, and what the allele frequencies of these would need to be to produce such data - dismissing both the multiple-common-variants and all-rare-variants alternatives.

However, before the upbeat ending, the authors note:

A highly polygenic model suggests that genetically influenced individual differences across domains of brain development and function may form a diathesis for major psychiatric illness, perhaps as multiple growth and metabolic pathways influence human height. Our results may also reflect heterogeneity, such that some patients have aetiologically distinct diseases. The shared genetic liability between schizophrenia and bipolar disorder, previously suggested by clinical and genetic epidemiology, opens up the possibility of genetically based refinements in diagnosis.However, the scores derived here have little value for individual risk prediction, meaning that application to clinical genetic testing for schizophrenia would be unwarranted. In the future, measures of polygenic burden, along with known risk loci and non-genetic factors such as season of birth, life stress, obstetrical complications, viral infections and epigenetics, could open new avenues for studying gene-gene and gene-environment interactions.

So, with no real hope of picking gene pathways when your best association result is in half the SNPs tested, and (therefore) with no predictions, this is not the paper Nature thought it was selling us, and not the one some people thought they were buying.

But it is still a tour de force and deserves to be read and discussed.


Notes:

[1] yes, with weekly calls not uncommon, these would be turned into a bunch of acronyms;

[2] crudely estimated from cost of GWAS chip on 7,000 subjects plus staff and overheads;

[3] I'm assuming Shaun Purcell - more widely known for the development of PLINK: whole genome analysis toolset - presented the analysis: only those involved in large Data Analysis group will know who did what;

[4] e.g. 5 * 10^-7 in WTCCC

[5] a wildly complicated region implicated in almost all auto-immune diseases, and with a bizarre haplotype naming scheme.Phylogenetic analysis suggests the earliest haplotype split was at least 40 million years ago

[6] 22q11.2 deletion region and ZNF804A replicated - the rest did not, see Supplementary Information section 5

[7] there is some disagreement about how to impute data - mostly around whether 60 CEU founders in the HapMap are enough for a brute force attempt. The 1000 Genomes project should see this off.

[8] most disease groups seem to have several competing international consortia rather than a single one, but who nonetheless can be persuaded to bury the hatchet when mutual interest requires it.


Neil Walker is "Head of Data Services" - a job title invented to avoid any pay and grading mishap, at a time when "data manager" was seen as synonymous with "data entry clerk" - at the Juvenile Diabetes Research Foundation/Wellcome Trust Diabetes and Inflammation Laboratory, and had a major role in the project management and QC of the first WTCCC experiment. His stats are, frankly, weak but he is very good at not believing results until he has been shown the data.


More like this

Kai Wang is a postdoctoral fellow at the Center for Applied Genomics, Children's Hospital of Philadelphia and an author on numerous genome-wide association studies. He left this lengthy comment as a response to my recent post on this comment by McClellan and King in Cell, and I felt it warranted…
David Goldstein, a geneticist at Duke, has critiqued the current focus on large-scale genomwide associations before. Now he is taking to the next step, as his group has a paper out which suggests that the reason that association studies have been relatively unfruitful in terms of bang-for-buck is…
Wellcome Trust Case Control Consortium. (2010). Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls Nature, 464 (7289), 713-720 DOI: 10.1038/nature08979 The Wellcome Trust Case Control Consortium has just published the results of a massive…
The successes of genome-wide association studies (GWAS) in identifying genetic risk factors for common diseases have been heavily publicised in the mainstream media - barely a week goes by these days that we don't hear about another genome scan that has identified new risk genes for diabetes, lupus…

3. the most obvious reason why 2 halves of a GWAS would look alike in their associations, is that the case and control samples, or case and control subjects differ in some way other than by the disease status of the subjects.[...]

Hmm. An ... interesting ... take on things, Neil, but you conveniently fail to mention that (i) this data is drawn from pretty much every sample available in the world; (ii) there is precious little to go on for the genetic basis of psychiatric disease, despite its high heritability; and (iii) wrt the quote above, the score makes a prediction from one study to another. If it were simply stratification or artefact, why would you expect the same SNPs to give signal? When using 50% of the markers in one study, you'd really have to have either exactly the same population structure or technical biases tracking across independent datasets, which doesn't really seem that plausible, somehow.

Interesting to see how much chatter this article is generating :-)

btw - full disclosure: I work in the same sphere as Shaun and know him personally, although I am not involved in the ISC.

I suspect that disease heterogeneity is a significant confounder in these big name gwas. But perhaps we're looking from the wrong dimension. If its true that "this data is drawn from pretty much every sample available in the world" big whoop. Data only gets bigger. As is there is a huge temporal bias. I'd like to study schizophrenia over every member of a 10 generation family tree. We don't have that yet. We will.

A "single-layer" perceptron can't implement XOR (src). I believe GWAS are similarly unable to see similar XOR multi-snp effects. I can't prove it, but perhaps other could say?

Chris - this was meant to be a general point about where scepticism comes from, rather than a criticism of this study:

3. the most obvious reason why 2 halves of a GWAS would look alike in their associations, is that the case and control samples, or case and control subjects differ in some way other than by the disease status of the subjects.[...]

Hmm. An ... interesting ... take on things, Neil, but you conveniently fail to mention that (i) this data is drawn from pretty much every sample available in the world; (ii) there is precious little to go on for the genetic basis of psychiatric disease, despite its high heritability; and (iii) wrt the quote above, the score makes a prediction from one study to another.

But, to rise to the bait:

(i) it is just about plausible that sample prep choices on a consortia-wide level, could ruin association results, even on a stupidly large set of samples. For example, trivially, we know that sample concentration varies enormously - by lab and protocol - and is a factor in how well SNPs are called in GWAS. This is why the QQ plots tend to improve with call rate. Less trivially, we're now seriously considering whether we can put our cell-lined cases against whole-blood controls in a next-seq experiment, given we know - thanks to a tip off from an indiscreet conference blog from Daniel, as it happens - that the former will throw up sequencing artifacts relative to the latter.

(ii) agreed, there is precious little to go on for the genetic basis of psychiatric disease, despite its high heritability. From this study, the statistical modelling chunk may lead to some testable hypotheses, but that is not psychiatry-specific - we've already begun to think whether we can model disease heterogeneity in diabetes using Shaun's techniques (remember, you read it here first) - and the couple of new hits are worth following up. But its thin fare, rescued by heroic stats.

(iii) I don't see any meaningful prediction at a personal level. If I gave you results for a GWAS-ed sample from any big study you could trivially assign it to the country and/or ethnic group the subject came from. You couldn't place it into the disease group it came from, if the disease groups were represented in proportion to the prevelance of the disease. To be concrete: say I have 10,000 samples, and 100 of these came from schizophrenic cases - a realistic prevalence of 1%. If I give you the GWAS results of one subject, you wouldn't be able to tell whether it came from a schizophrenic case or not, whatever the genotypes. But 2,000 schizophrenics vs 2,000 not - then you can tell the difference, repeatedly, and across countries, between the two groups. While interesting, it doesn't necessarily get us very far.

Same holds true for most complex diseases, if not all - discuss.

cariaso wrote:

I suspect that disease heterogeneity is a significant confounder in these big name gwas.

Absolutely.

But perhaps we're looking from the wrong dimension. If its true that "this data is drawn from pretty much every sample available in the world" big whoop. Data only gets bigger.

It means you haven't got enough samples for a follow-up study - we've been there.

As is there is a huge temporal bias. I'd like to study schizophrenia over every member of a 10 generation family tree. We don't have that yet. We will.

The highest estimate of parent-to-offspring transmission rates I've seen quoted for schizophrenia is 1 in 4. This is thought to be an over-estimate, because of "anticipation" bias - whereby the child of someone with a major psychiatric disease is diagnosed more readily, and earlier, than someone without such parents. The chances of being diagnosed if your sib is diagnosed, is more like 8-14%.

Perhaps because of the loss-of-affect aspect of schizophrenia, rather than its more famous psychotic symptoms, sufferers also tend to have fewer children than the general population - perhaps 50% down?

So, you're not going to see a 10-generation family of schizophrenic cases - unless you've found some new (and probably very rare) monogenic disease with a similar outcome.

But that is all part of heterogeneity.

"As is there is a huge temporal bias. I'd like to study schizophrenia over every member of a 10 generation family tree. We don't have that yet. We will."

I believe that it's true but i dont certainly sure.

The long and the short of it is that schizophrenia is highly heritable, like just about everything else, yet like just about everything else, we cannot find any genes for it.

Given that evolution will strongly remove strong direct genetic influences on schizophrenia, those genetic influences that remain will be shielded in some way from evolution - which will also make them hard to observe by gene scanning.

Any mutation that directly and straightforwardly caused a substantial risk of schizophrenia will be strongly selected against, therefore, will be very rare, hence only observable in full genome scans, which we are not yet doing.

Therefore, any common gene for schizophrenia that could potentially be observed by present methods will be of small direct effect.

These common, small effect genes, will each affect one of several quantities, pathways, probably a dozen or so, that cannot be directly observed, but if any one of these quantities is substantially perturbed from normal values, perhaps by having a hundred or so common genes of small effect that alter one of these dozen or so unobservable quantities all in the same direction, then the individual is at risk.

Suppose we have 10 quantities, pathways to madness, call them X1, X2, X3, X4, X5, X6, X7, X8, X9, X10

Suppose that each of these quantities is affected by a thousand common genetic variants of small effect. Some of these variants increase X3, some of them decrease X3. A variant that increases X3 does not cause madness except in combination with lots of other variants that increase X3.

Worse, suppose X3 has an optimum value, as it probably does - if X3 is too high, you go mad in one way, and if X3 is too low, you go mad in a subtly different way, but get the same diagnosis either way. In that case, common variants that have substantial effect still will not show up in such studies - nor will evolution very efficiently winnow them out, so they are likely to be the sort of genes that remain.

What we would therefore expect is that there will be near zero correlation between schizophrenia and any common variant, but substantial correlation between schizophrenia and combinations of many variants - because evolution would rapidly eliminate any variant where there was a direct correlation.

Great discussion. I'm inclined to agree with both Cariaso and Neil--heterogeneity has to be mucking things up in a major way. The discovery of rare single-gene causes of Alzheimer's and Parkinson's et al was greatly aided by large multigenerational pedigrees, including those in population isolates. It's hard to imagine having access to reagents like that in schizophrenia unless one collects phenotypically mild (and presumably homogeneous) versions.

Thanks for the terrific post!