Serious flaws revealed in longevity genes study

When an article was published in Science last week reporting that DNA samples from exceptionally long-lived individuals differed detectably from those of normal individuals, it got plenty of positive attention from the mainstream media. However, the buzz from experts was rapid and telling: my colleagues in the statistical genetics community weren't excited about the results, but immediately, profoundly skeptical.
For people who've spent years doing genome-wide association studies (GWAS), several things stand out as unusual from this paper: the very large effect sizes of the identified SNPs (as noted by colleague Jeff Barrett in the Guardian), the extraordinary claim that the identified variants were able to correctly classify individuals as potential centenarians with 77% accuracy (a totally unprecedented level of accuracy for a complex trait), the fact that the associated variants haven't previously been associated with protection against any other common diseases (as you might expect them to be, given longevity is effectively a matter of avoiding or surviving every common disease), and several subtle technical issues, such as a rather strange-looking Manhattan plot (shown at the end of this post).
If the paper's claims were true they would be truly remarkable. However, the general feeling from the GWAS community right now seems to be that the identified associations are likely to be largely or even entirely artefactual, the result of failing to fully control for differences in the genotyping methods used in the cases and controls. The study used a mixture of two different genotyping platforms (albeit both made by Illumina) for their centenarians, while the control data was taken from an online database containing samples examined using multiple platforms. Disturbingly, similar potential genotyping bias also affects their replication cohort.
In a great article in Newsweek today Mary Carmichael has a series of damning quotes from big-name geneticists casting doubt on the study's findings. deCODE Genetics CEO Kári Stefánsson is (unsurprisingly) the most vociferous: he notes that there are consistent and previously known genotyping problems on the SNP chip used in the study for the two most strongly associated SNPs, and then goes further to argue that technical problems probably underlie nearly all of the reported associations in the paper:
Stefánsson says he is "convinced that the reported association between exceptional longevity and most of the 33" variants found in the Science study, including all the variants that other scientists hadn't already found, "is due to genotyping problems." He has one more piece of evidence. Given what he knows about the 610-Quad, he says he can reverse-engineer the math in the BU study and estimate what fraction of the centenarians were analyzed with that chip. His estimate is about 8 percent. The actual fraction, which wasn't initially provided in the Science paper, is 10 percent, the BU researchers tell NEWSWEEK. That's close, given that Stefánsson's calculations look at just two of the variants found in the study and there may be similar problems with others.
Carmichael goes on to note a major methodological flaw in the paper: the failure to even attempt to validate any of the associated SNPs on an independent platform. This is absolutely standard practice in normal GWAS, and should have been demanded by referees - especially given the extraordinary claims being made in the paper.
What needs to happen next? For a start, the authors should release the raw intensity data for their genotyping experiments, which would allow independent investigators to spot obvious problems. Doing so immediately on a public database would go a long way towards showing they're not trying to cover up any methodological flaws. Ideally, they should also validate their putative associated SNPs using an independent platform and release those raw data as well.
More broadly, this is an important lesson for the increasing number of investigators wandering into the GWAS arena: they need to be aware that the genotype data they're working with aren't just clean, digital data points, but best-guess estimates (typically very reliable, but sometimes badly flawed) based on an noisy fluorescent intensity signal. There's a reason why researchers working on GWAS spend so much of their time on a regimented series of upstream "data cleaning" steps and careful downstream validation of new associations - it's all too easy for noisy data to introduce bias that produces a false association signal. So, kids, don't end up in Newsweek for all the wrong reasons: talk to someone who really knows what they're doing when it comes to GWAS data.
Finally, major journals need to stop letting sexiness push aside scientific rigor. Carmichael says it nicely:

Still, one has to wonder how the paper wound up in Science, which, along with Nature, is the top basic-science journal in the world. Most laypeople would never catch a possible technical glitch like this--who reads the methods sections of papers this complicated, much less the supplemental material, where a lot of the clues to this mystery were?--but Science's reviewers should have. It's clear that the journal--which hasn't yet responded to the concerns raised here--was excited to publish the paper, because it held a press conference last week and sent a representative to say as much.

If the key results from this paper do turn out to be based on easily-detected experimental artefacts, Science deserves to be embarrassed.
Anyway, here's the image that really made me go "whoah" - the Manhattan plot from the paper, tucked away in the Supplementary Data, which shouts "artefact" to anyone who's seen even a few GWAS papers. For the uninitiated, each dot in the plot represents a different SNP, with the alternating bands of colour showing different chromosomes. The y axis indicates the strength of the association between that SNP and longevity. The plot is unusual for a GWAS in that all of the highest-ranked SNPs are hanging out there by themselves, rather than being flanked by a column of other associated variants - a pattern characteristic of genotyping error rather than true association.


In contrast, here are the Manhattan plots from a "good" GWAS - the Wellcome Trust Case Control Consortium's analysis of 7 different common diseases (I've trimmed out two uninteresting ones), with the statistically significant SNPs highlighted in green. You can see that basically all of the most strongly significant SNPs are found in a "tower", the result of nearby SNPs being correlated with one another and thus all marking the same association signal:
Spot the difference?
Edit 09/07/10: Check out Peter Kraft's comment below for some further issues with the paper.

More like this

Source. A newsworthy study about a genetic signature of centenarians published in Science has not stood up to scrutiny by the blogosphere and peer scientists and has now been formally retracted by the authors. Until recently, such retractions - whether by Editors or by the authors themselves - have…
Purcell et al. (2009). Common polygenic variation contributes to risk of schizophrenia and bipolar disorder Nature DOI: 10.1038/nature08185 Neil Walker has been doing a spectacular job of serving up useful information in the comments recently, so I asked him to write the first ever guest post on…
A recent PLoS Genetics paper triggered a sea change in the way genetic data is handled by research institutions like the NIH, the Broad Institute, and the Wellcome Trust. The paper, which came out last month, demonstrated that it's possible to identify a single individual's DNA in a pool of DNA…
Kai Wang is a postdoctoral fellow at the Center for Applied Genomics, Children's Hospital of Philadelphia and an author on numerous genome-wide association studies. He left this lengthy comment as a response to my recent post on this comment by McClellan and King in Cell, and I felt it warranted…

Excellent post. I am afraid, What the public "wants" to hear

"We can find out if you will live to 100"

Trumps good science or good medicine. Further leaving everyone with a bad taste in their mouth about GWAS developed tests for clinical use.

I can't believe this asn't picked up in Science.....It would have been in NEJM.


Daniel, thanks for the excellent commentary.

"If the key results from this paper do turn out to be based on easily-detected experimental artifacts, Science deserves to be embarrassed."

I think that Science should be embarrassed whatever the outcome, there are so many issues with this paper and the latest is your revealing Manahattan plot, that it should never have been accepted without further verification. Especially by Science. Even if these results turn out to be correct, if they keep on accepting papers with this level of uncertainty, a large proportion of them will be false.

Then there's that 77% figure.

(1) Results from sophistimacated multi-locus models like those used by Sebastiani et al. will be even more susceptible to artefact caused by subtle (or not so subtle) measurement error at many loci. This is already a major worry with the rather simplistic "polygenic analyses" like those in last year's Nature paper on schizophrenia that simply sum up risk alleles. The model used here is far more flexible, which is a double edged sword--sure you might find some bizarre multi-locus combination that increased your chance of long life, on the off chance that there is one, but you're also much more likely to find odd artefacts, which surely do exist.

(2) Even taking their model as true, that 77% number is meaningless outside the context of this study. That number depends on how many people in your sample have the trait you are trying to predict. The replication sample was about 50% centenarians. So 77% sounds impressive, but the appropriate baseline is about 50%, which is how accurate Paul the octopus would be. Plus that number has no meaning for somebody thinking about their chance of living a long time. I was glad to see Kari and David jump on this point in the Newsweek article.

Yes, if true [a big if], the fact that a model using these markers does better than guessing at random shows that these markers are in combination associated with longevity--but that's all it shows. Folks have to be very careful with language about "predictive accuracy" because there are so many different metrics, most of which are not immediately applicable in the context of personalized medicine, which is very likely how they will get picked up in the popular press, as Sebastiani et al. have learned. I've heard them on local radio arguing against using this signature to predict one's probability of living to 100, but that may be trying to close the barn door after the horses are out.

By Peter Kraft (not verified) on 08 Jul 2010 #permalink

Excellent post. From the looks of things, any reviewer with the slightest familiarity with GWAS would have instantaneously seen the problems with the data even with a superficial look at the figures. This suggests to a high degree of likelihood that Science editorial staff failed miserably in getting appropriate reviewers for this paper.

The analysis proceeded using a Bayesian model which built on the most significant SNP, rs1036819, which was wrong. See the supplementary material, page 10 (page #9):

That's probably highly significant because of the genotyping artifact. And many of the other SNPs identified probably are as well.

It's a bit like setting out from Nashville on a trip to NYC by going West on I-40...with a broken GPS.

The plot you referenced: what platform was that WTCCC data generated on?

excellent article.

By genome grrl (not verified) on 08 Jul 2010 #permalink

The WTCCC used Affymetrix chips (the old GeneChip 500K Mapping Array Set, to be precise).

Hi Peter,

Excellent commentary, as always. I've edited the post to point people to your discussion.

Good post. I agree with Peter Kraft in that the "77% accuracy" of the paper is misleading, but I disagree somewhat with his explanation of why. The authors claim 77% specificity and sensitivity and hence 77% accuracy (probably as defined by ). One of the useful features of these statistics is that they are all independent of penetrance (here the probability of reaching 100), e.g. sensitivity is the fraction of cases correctly predicted.
The misleading bit is that when you have a low penetrance (probability of reaching 100 is probably between 1/1000 and 1/10,000) then having specificity of 77% is not particularly useful. Assuming a penetrance of 1/1000, my calculations say your chance of reaching 100 will grow to 3.3/1000 if you are predicted to be longevious (the positive predictive value) and fall to 0.3/1000 if you are not (1-the negative predictive value). Not many people, apart from statisticians, would call that 77% accurate.

By Daniel Gudbjartsson (not verified) on 09 Jul 2010 #permalink

Thanks Daniel G, I had missed the fact that what they call "accuracy" is the sensitivity and specificity of their test. I'm used to the convention where accuracy is defined rather intuitively as the marginal %age of predictions that are correct--i.e. if you randomly select somebody from the population (blind to their outcome and their genotype), accuracy is the probability their predicted outcome matches their true outcome. Sensitivity and specificity are typically referred to as measures of "discrimination"--a way of summarizing how test results differ between those with the outcome and those without.

As you point out, having seemingly high sensitivity and specificity do not guarantee a high probability of having the outcome if you test positive. This is a general property of conditional probabilities, discussed here:

To be fair to me ;-) it is rather odd to report sensitivity (the probability somebody with the outcome tests positive) and specificity (the chance somebody without tests negative) as one number. They need not be the same--and in fact they rarely are (they are different in the longevity training set, for example). It was rather serendipitous--and ultimately a little confusing--that they could be reported as a single number.

The other odd thing here is sensitivity and specificity make sense for binary tests, where one tests positive or negative. Fancy genetic risk algorithms like the one discussed here don't return a binary prediction: they return a continuous probability, somewhere between 0 and 1. Sebastiani et al. used an arbitrary threshold to define positive and negative tests (predicted risk higher than the proportion of cases in the sample was a positive test--using this threshold Paul the octopus' risk model would have 43% sensitivity and 57% specificity). But there are other thresholds (anywhere from 0% to 100%), and each will give a different sensitivity and specificity. This is why the discriminatory ability of a test is usually presented as a receiver operating characteristic or ROC curve (plot of sensitivity versus 1-specificity over the range of possible thresholds--available for the longevity model in the supplementary materials). The area under this curve (aka the C or concordance statistic) is often used as a one-number summary of the discriminatory ability of a test. This is the figure 23andMe could not replicate.

Although it is something of a workhorse, there has been a lot of discussion of the shortcomings of the ROC curve in the biostatistics literature lately as a way of summarizing a predictive medical test. (This discussion is not restricted to genetic tests.) For example, since the C-statistic averages over all possible thresholds, it weights what may be implausible decision rules like "treat everybody" or "treat nobody" equally with more plausible strategies (only treat folks where the benefits of treatment outweigh the risks). Of course what constitutes plausible depends on context--those risks, benefits and costs.

By Peter Kraft (not verified) on 12 Jul 2010 #permalink

To reach 100, after many years of looking and feeling younger, faster and stronger, do what many healthy centenarians do--eat a low-fat diet with lots of grains and legumes. All seeds contain the simple glucose isomer Inositol, which activates the same genes as long-term caloric restriction (J Barger, 2008). My father reached 99 on oat porridge. Check out anti-ageing among your friends and acquaintances--it is happening right now, in low-fat vegetarians, and also in omnivores who do whole-grain breakfasts. These folks look years younger than their age, and have unusual energy and endurance (Inositol activates the master gene for mitochondrial biogenesis and cellular energy production, PGC 1 alpha. It's as simple as that.

By Dr Robert Peers (not verified) on 20 Jul 2010 #permalink

Very instructive post, Daniel.

What struck me the first thing was that single-locus results were downplayed, and the paper appeared to zoom in on aggregate analysis. I'd always thought that single-locus SNPs must be replicated exactly before any form of multi-variate / multi-SNP score performed.

By Chiea C Khor (not verified) on 21 Jul 2010 #permalink

a manhattan plot from an ILMN gwas chip would be more convincing. Although it is a classical platform, we know the affy 500k chipset, used by WTCCC, has highly correlated SNPs in some regions while miss some other regions. Assays on the ILMN chips are expected to be more evenly distributed and less correlated.

Did anyone notice the new notice added to the beginning of this paper?

"Following publication of their paper in Science Express, Sebastiani et al. were made aware of an inherent defect in the 610-Quad chip that they used to genotype 7% of their discovery set (60 of 801 samples) and 17% of their replication set (44 of 254 samples). This defect may have led to incorrect genotyping of some of the SNPs that Sebastiani et al. used to build their genetic classification model for exceptional longevity. The authors are reanalyzing their data to determine the extent to which the genotyping errors affect their classification model.
The Abstract has been edited for clarity. Sentence 3 in the original Abstract has been replaced with âUsing these data, we built a genetic classification model that is based on 150 single-nucleotide polymorphisms (SNPs). When we applied this model to an independent set of centenarians and control individuals of average longevity (AL), we found that it correctly classified individuals into the EL or AL group 77% of the time.â "

Hi Adam,

Yes, that's been there since just after this controversy erupted (but hasn't been commented on publicly, so thanks for adding it to the thread).

One can't help but wonder, after three months, just how much longer the reanalysis will take - but rest assured there are several people who are actively pushing to get hold of the final results...