It’s difficult to distill down a meeting as data-rich as the Cold Spring Harbor Biology of Genomes meeting, but here’s a first-pass attempt.
We’re sequencing lots of people
One of the highlights of the meeting was the update on progress from the 1000 Genomes (1KG) Project. I was fortunate enough to have been given a sneak peek at the data at the 1KG satellite meeting earlier in the week (which you can download yourself if you’re so inclined), but it was still impressive to see it all put together in the presentation today by Goncalo Abecasis.
Abecasis reported on the data emerging from the three pilot projects of 1KG: a very high-resolution analysis of six individuals (three individuals each from a European and a West African family); a much lower-resolution scan across the genomes of 180 individuals (60 Europeans, 60 West Africans, and 60 East Asians); and a targeted analysis of 1000 randomly selected genes in several hundred individuals from multiple populations.
The data emerging from the pilot projects are still pretty raw, but the numbers are impressive: the project has already identified over 20 million single-base variants (SNPs), over 11 million which are completely novel; 40,000 short insertion/deletion polymorphisms; and over 4,000 larger structural rearrangements of DNA.
There’s much more to come: the project is scaling up to generate low-coverage sequence data for 1,200 individuals by the end of 2009, and may expand this set of samples to incorporate additional populations in 2010. As befits a project seeking to create a resource for the broader genetics community, the data will be made publicly available as it is generated.
Lots of sequence data is useful
The catalogue of human genetic variants created by the 1KG project will be of immediate benefit to researchers working on the genetic basis of complex diseases. Gil McVean spelled out how this will work by applying early 1KG data to results from the Wellcome Trust Case Control Consortium using the process of genotype imputation.
Genotype imputation starts by using a reference panel with very high-resolution genetic data to define the patterns of association between nearby variants. That information can then be applied to a set of disease cases and healthy controls that have been genotyped at only a small subset of those variants; using the association data from the reference panel it is possible to impute the genotypes of these individuals at many other sites. Like magic, genotype imputation allows you to “see” genotypes at millions of sites throughout the genome that were never directly typed experimentally.
I’ll talk more about the details of McVean’s results later; for now, suffice it to say that he showed that imputation using 1KG sequence data as a reference can add non-trivial value to the results of existing genome-wide association studies – value that will only increase as the number of individuals sequenced in the project increases.
Adding functional information to sequence data
Vast amounts of sequence data will have only limited value unless we can come up with ways of figuring out exactly which sites in the genome are actually functional, and of predicting what effects genetic variation can have on human physical variation and disease risk.
Several talks approached the functional annotation of the human genome from a variety of angles. Stephen Mongomery and Tony Kwan both discussed approaches to pin down the specific genetic variants that affect the levels of gene expression; such variants are excellent candidates for playing a role in other human traits. David Goode presented an analysis combining data on human genetic variation in regions that are conserved over deep evolutionary time, which suggested that the vast majority of genetic variants with functional effects reside outside the protein-coding regions of genes. Figuring out exactly which sites in the genome are actually under deep evolutionary constraint is non-trivial, but some extremely clever approaches to inferring this using low-coverage genome sequences from 29 mammals were presented by Adam Siepel yesterday.
All of these approaches are interesting to anyone thinking about taking full advantage of personal genome sequences: how can we figure out which of the millions of genetic variants present in an individual’s genome actually have an impact on function and disease risk? Combining information from multiple sources will be essential to answering this question.