Jones et al. (2009). Exomic Sequencing Identifies PALB2 as a Pancreatic Cancer Susceptibility Gene. Science DOI: 10.1126/science.1171202
A paper published online today in Science illustrates both the potential and the challenges of using large-scale DNA sequencing to identify rare genetic variants underlying disease risk.
Traditionally, geneticists have pinned down such variants using large family studies. By using these families to track which parts of the genome tend to be co-inherited with the disease, it's possible to zoom in on the region of DNA that harbours the disease-causing mutation. This step is then followed by laboriously sequencing each gene within the disease-linked region (guided by functional information, when available) in the hope of eventually finding an obvious disruptive change.
Although this approach has been successful in identifying thousands of genes associated with severe disease, it breaks down when the disease is sporadic (i.e. is not associated with a family history), is found in a small family, or when other family members are not available for testing. Without a large family to link the gene with disease risk there's no way to narrow down the list of genes responsible, making it impossible to find the underlying mutation.
Until now. Within the last year, the combination of DNA capture approaches with cheap, large-scale sequencing technology has made it technically feasible to simply sequence every known protein-coding gene in the genome (in combination known as the exome) to hunt for possible mutations. Although it's not a complete genome sequence - it leaves out the ~98% of the genome that doesn't code for protein - this approach offers the possibility of finding novel protein-altering mutations even in isolated disease cases.
In the Science paper, the authors made use of exome sequence from a female pancreatic cancer patient to look for possible susceptibility mutations that may have predisposed her to the disease. In this patient's case the existence of a susceptibility gene (rather than an environmentally-induced cancer) was suggested by the fact that her sister had also developed the same type of cancer.
In the first sentence of the paper the authors frame this study as an explicit test of the utility of personal genome sequencing - and indeed it provides some useful insight into both the value of large-scale genetic data, and just how difficult it will be to find disease-causing variants even with the complete sequence of every coding gene.
First, the good news: as you might have guessed from the fact that the study is published in Science, the authors did in fact find the likely disease-susceptibility mutation. They were able to distinguish this mutation from the many other variants in the patient's exome (more on those in a second) by a particular quirk of cancer susceptibility variants: they are often found in only a single copy (along with a healthy version of the gene) in normal tissue from a patient, whereas in cancer cells from the same patient the normal copy is disrupted.
Because the researchers had access to exome sequence from both normal tissue and cancer cells from the same patient, they were able to find just three genes containing variants that fit this pattern, only one of which looked like a realistic disease-causing candidate (the other two genes are known to contain disruptive mutations in healthy individuals).
In a follow-up study the researchers looked at 96 other pancreatic cancer patients and found 3 more carrying mutations in the same gene; over 1,000 healthy individuals were free of mutations. That makes it seem pretty likely that this is a genuine disease-causing mutation, the first ever published example found using whole-exome sequencing.
Now, the bad news: the researchers also found a whole stack of red herrings. In total, the authors looked at sequence from 20,661 genes, and identified 15,461 genetic variants not found in the reference human genome. Of these, 7,721 changed the sequence of the encoded protein, 64 resulted in abnormal stop codons, 108 were predicted to alter RNA splicing of the gene, and 250 were small deletions or insertions (115 of which would be predicted to dramatically alter the encoded protein through a frameshift). The stop codons, splicing mutations and frameshift insertion/deletions, and many of the protein sequence-altering variants, would all have to be regarded as plausible candidates for a disease-causing mutation.
Although it would probably be possible to exclude many of these variants using other sources of information (e.g. functional information about the genes, presence in healthy controls, patterns of evolutionary conservation), this is an enormous number of potential disease-causing variants to filter. The success of the authors in identifying PALB2 as the disease-causing gene relied heavily on the "one bad copy in normal tissue, two bad copies in cancer" rule, but most other severe diseases do not provide such convenient sign-posts.
The sheer scale of the noise variation in the human genome has only really become apparent in the last two years, following the publication of the Watson and Venter genomes. Both of these genomes contained a huge number of variants that could easily be interpreted as disease-causing, often with no clear way of distinguishing the villains from the innocent bystanders.
As such, researchers hunting for disease-causing mutations using genome-scale data will find their traditional problem is now turned on its head: instead of being unable to find plausible mutations, they will be faced with far too many possible candidates.
That problem will only get worse as we move from exome sequences - which at least comprise segments of protein-coding DNA for which we mostly understand the basic biological rules - to the vast, swampy, uncharted morass of non-coding DNA that makes up the other 98% of our genomes. It's clear from recent genome-wide association studies that the majority of disease risk variants are lurking in these regions, but we're currently almost entirely unable to filter out the functional disruptors from the millions of other polymorphisms littering non-coding DNA.
So the message from this paper is mixed. On the one hand, this is a genuine triumph for brute-force genomics, a case where generating staggering amounts of sequence data produced results with very clear clinical relevance. On the other hand, filtering out the true disease mutation from the background noise owed a hefty amount to the special properties of tumour suppressor genes, and more than a little luck; this approach will not be so easy in all cancer patients, and certainly not in patients suffering from other genetic diseases.
There's a dire warning here, as the age of clinical genomics approaches with blinding speed: if we want to be able to convert masses of sequence data into useful clinical information we need to get much better at assigning function to new sequence variants, and we need to learn how to do it fast.
Update: An estimate of the cost of generating an exome sequence:
The investigators say that the cost to determine the sequence of all genes in an individual for this project was approximately $150,000, but that this cost will likely decrease considerably in the future.
That seems way too high for me (I'd estimate a cost under $30,000, including labour and reagents), but I guess we're talking about work done six months to a year ago - costs have plummeted in the meantime.
Seems like evidence to me that sequencing alone doesn't hold all the answers. I think it's clear that we need to look long and hard at other factors that influence the system, like copy number, epigenetic silencing, etc.
We're also going to need to develop better models of gene interactions and get better at integrating functional data. We're now able to tell where the mutated genes are - now we need to figure out what the heck they're doing.
I think most geneticists (including those currently involved in large-scale sequencing projects) would agree that we need to go beyond simply generating sequence data. There are already many groups doing high-throughput functional annotation of (for example) protein-protein interactions and regulatory elements. These types of data already allow us to build rough models of the biological pathways that underpin disease, and these will become increasingly more complete and accurate over time.
However, I'm also a strong advocate of the "more sequencing" approach. Human genetic diversity provides a great experimental model for pinning down gene function; the more whole genome sequences we have associated with data on phenotype variation (e.g. medical records), the better equipped we will be to link sequence variation to disease risk. The best model system for humans is other humans.
American "Big Science" (for that matter, "Big Anything") can occasionally be incredibly short-sighted. Just remember Prez Johnson, who tried to eradicate cancer just by throwing some tens of billions of dollars at it. Or, the US government spent about $300 Billion (according to some guestimates) to come up with the "Human DNA".
Not only they did not have much idea of what to do with the "Human Genome" - but actually had the demonstrably wrong idea that 140,000 genes would be found - while the 98.7% could be safely disregarded as "Junk DNA"
Neither was correct.
I fully agree with Daniel (and everybody fixated on the issue) that a large body of affordable full human DNA sequencing is needed.
However, suppose we already have them (almost, since Complete Genomics is ready with the $5,000 full genome). Then what?
I respectfully submit that focusing only on the exons of the genes (actually, on much less than 1.3% - since in genes the introns are in general much longer), we will probably continue to miss much of the so-called "genome regulation diseases" (formerly; "Junk DNA diseases").
The problem is that in exons we know what to search for, for instance point-mutations that turn amino-acid codons into stop codons.
To effectively search in the vast seas of the intergenic regions, we must first understand at least the basics (principles) of genome regulation.
My attempt towards this is The Principle of Recursive Genome Function