A paper just published online in Nature Genetics describes a brute force approach to finding the genes underlying serious diseases in cases where traditional methods fall flat. While somewhat successful, the study also illustrates the paradoxical challenge of working with large-scale sequencing data: there are often too many possible disease variants, and it can be extremely difficult to work out which are actually causing the disease in question.The authors looked at 208 families where multiple members suffered from mental retardation and where the family history was consistent with the underlying gene being carried on the X chromosome. In most cases the families weren't large enough to use linkage analysis to narrow down the location of the gene - in other words, the disease-causing mutation could be almost anywhere among the more than 800 genes scattered along this chromosome.
In these cases the traditional approaches of genetics break down - apart from screening the known genes involved in mental retardation and hoping for a lucky break, there's little that can be done to find the gene responsible. The researchers thus took advantage of automated large-scale DNA sequencing to simply analyse the protein-coding regions of nearly every gene on the X chromosome.
The researchers found many genetic variants that would be expected to disrupt gene function: almost 1000 changed the predicted protein encoded by a gene, 22 introduced unusual "stop" signals, 15 changed the reading frame and 13 were found in strongly evolutionarily conserved regions associated with RNA processing.
Of the 42 variants most likely to cause disease (so-called "truncating" variants) 38 were found in only one family, and these tended to cluster together in specific genes - for instance, one gene contained 5 different rare, damaging mutations. However, many of these variants were found in both patients and their healthy male siblings, suggesting that they are not causative in mental retardation. These genes could represent subtle predisposing factors for mental retardation, but it's likely that most of them are simply genes that can be inactivated with little or no deleterious consequences for humans.
Overall, only nine genes showed strong evidence for disease-causing mutations. The researchers went on to sequence these genes in a further 914 mental retardation patients and over a thousand controls, but found only a handful of likely disease-causing mutations in these genes in other patients.
Although the technical achievement is impressive, the picture from this survey is somewhat depressing (although not really surprising) for researchers interested in using large-scale sequencing to discover disease-causing variants. It's a clear demonstration that even examining the majority of protein-coding sequence will be insufficient to capture most of our nasty genetic secrets - many of these lurk deep in non-coding DNA, while a fair chunk of the remainder simply hide in the biological noise resulting from all of the other non-disease-causing variants in the genome. In this study it's likely that the researchers have actually uncovered a fair number of disease-causing mutations (for instance, among the almost 1000 protein-altering variants) but are currently simply unable to distinguish them from benign polymorphisms.
What's the solution? More sequencing, for a start - digging deep into the non-coding portions of the genome, and also ensuring very accurate coverage of the protein-coding portions (in this study an average of just 75% of the targeted regions were actually successfully sequenced in any given individual). This is already entirely feasible due to the emergence of second-generation sequencing, and will become rapidly more affordable as sequencing costs drop. Already there are research groups around the world planning massive sequencing studies to identify rare mutations underlying severe diseases.
But sequencing won't be enough: we need much better methods for sifting out the truly function-altering genetic variants from the biological noise. This is already difficult enough for protein-coding regions (as this study demonstrates); we currently have virtually no way of picking out disease-causing variants in the remaining 98% of the genome. There's a clear need for developing highly accurate and comprehensive maps of the functional importance of each and every base in the human genome, using all of the tools at our disposal - something that will keep us geneticists busy long after we've run out of genomes to sequence.


Comments
A program developed by Cornell researchers deduced the natural laws without a shred of knowledge about physics or geometry. The research is being heralded as a potential breakthrough for science in the Petabyte Age, where computers try to find regularities in massive datasets that are too big and complex for the human mind."
http://blog.wired.com/wiredscience/2009/04/newtonai.html
Programs like these are in their infancy but they are enjoying some successes. It isn't just genetics that is drowning in data.
Posted by: Kevin | April 21, 2009 9:04 AM
Did you read about the Incidentalome?
http://jama.ama-assn.org/cgi/content/extract/296/2/212
Noise is the big problem here. And will continue to persist for decades.
-Steve
Posted by: Steven Murphy MD | April 21, 2009 9:15 AM