Limitations of 1000 Genomes Project

A Nature News article discusses the ongoing 1000 Genomes Project, an international effort planning to sequence 1,200-1,500 human genomes. The discussion springs from project co-chair David Altshuler's update at last week's American Society of Human Genetics meeting on the progress of the project (in brief: 3.8 terabases down, 996.2 terabases to go).

The article provides a generally positive overview of the project's historical context, goals and progress. The one contrary note comes from Duke University's David Goldstein, who has previously publicly expressed skepticism regarding the value of much of the data currently emerging from genome-wide association studies looking for common variants underlying common disease risk. Goldstein is also wary about over-stating the importance of the 1000 Genomes data:


"1000 Genomes will be hugely useful for growing the technology to generate and analyse sequence data," says David Goldstein of Duke University in Durham, North Carolina, adding "But in terms of a catalogue of the variants most important to human biology and disease, it's less clear how important it will be." Goldstein advocates sequencing people with extreme presentations of disease to understand more about common disease pathways.

This is an important point, which isn't well-explained in the article. The 1000 Genomes Project will provide a fairly comprehensive catalogue of genetic variation present at frequencies above 1% (at least within its target populations), but its coverage of even rarer variants will necessarily be incomplete.

This is mind-numbingly obvious: no catalogue of variation would be truly "comprehensive" unless it sequenced every human being on the planet at birth. At the extreme end of the frequency spectrum, all of us carry perhaps a hundred or so variants that are unique to us (in terms of identity by descent, at least) since they occurred in the sperm and egg cells that gave rise to us - sequencing 1000 genomes won't find those variants unless we happen to be one of the study participants. In between those unique variants and the lower end of the range covered by 1000 Genomes - perhaps 0.1 to 1% - there is a large swath of human genetic variation that will be almost entirely missed by the project.

Do these extremely rare variants actually matter, in terms of predictive personalised medicine? Goldstein obviously thinks so, and I agree with him. There are sound theoretical reasons and growing empirical evidence to suggest that the lower end of the frequency spectrum is enriched for large-effect disease risk variants, precisely those variants that will be most useful for making predictions about individual disease risk. These variants will only be identified by deep resequencing of large cohorts of disease patients and controls, with the "extreme cohort" approach advocated by Goldstein representing a particularly powerful strategy. I'll have a lot more to say about hunting rare variants over the next couple of weeks.

In addition to missing extremely rare variants, the short-read sequencing technology providing the backbone of the Project's pipeline will also struggle with the regions representing the real "dark matter" of the genome: the highly repetitive regions, constituting perhaps 10-15% of human DNA, which are largely untouchable by short-read platforms. Improved sequencing technology will dig further into these areas over the next twelve months, but it's likely that many unmappable regions will persist well beyond that. It's currently unclear how much functionally important variation exists in these regions - however, no catalogue of variation that excludes them can realistically be said to be "comprehensive".

That's not to say that the Project is not an important step in the right direction - on the contrary, the data emerging from the Project over the next twelve months will be incredibly useful in many areas of human genetics (e.g. nailing down the causal variants in regions highlighted by genome-wide scans for disease risk, normal variation, and recent natural selection; and cataloguing the "pretty rare" variants between 0.5 and 5% frequency for the next generation of genome scans). But it is still important to emphasise that the map of variation generated by the Project will still contain important dark areas - and that there's plenty of work left for human geneticists to do.

(As an aside, what's with the picture associated with the Nature News article? That's a pretty limited slice of human genetic diversity right there...)

Subscribe to Genetic Future.

Categories

More like this

the short-read sequencing technology providing the backbone of the Project's pipeline will also struggle with the regions representing the real "dark matter" of the genome: the highly repetitive regions

Agreed. I think we're going to find that structural variation plays more of a role than anyone imagined 4 or 5 years ago. Besides the obvious deleted, truncated, or fused genes, there are all sorts of more subtle dosage effects to consider. No one technology is going to address all this.