Can't find your disease gene? Just sequence them all...

i-b95ebfafe45471174b42060398bae58b-x-chromosome.jpgA paper just published online in Nature Genetics describes a brute force approach to finding the genes underlying serious diseases in cases where traditional methods fall flat. While somewhat successful, the study also illustrates the paradoxical challenge of working with large-scale sequencing data: there are often too many possible disease variants, and it can be extremely difficult to work out which are actually causing the disease in question.

The authors looked at 208 families where multiple members suffered from mental retardation and where the family history was consistent with the underlying gene being carried on the X chromosome. In most cases the families weren't large enough to use linkage analysis to narrow down the location of the gene - in other words, the disease-causing mutation could be almost anywhere among the more than 800 genes scattered along this chromosome.

In these cases the traditional approaches of genetics break down - apart from screening the known genes involved in mental retardation and hoping for a lucky break, there's little that can be done to find the gene responsible. The researchers thus took advantage of automated large-scale DNA sequencing to simply analyse the protein-coding regions of nearly every gene on the X chromosome.

That's a total of one million DNA bases per patient - a
particularly impressive figure given it was generated using traditional
Sanger sequencing rather than one of the massively high-throughput
second-generation sequencing platforms now available.

The researchers found many genetic variants that would be expected to disrupt gene function: almost 1000 changed the predicted protein encoded by a gene, 22 introduced unusual "stop" signals, 15 changed the reading frame and 13 were found in strongly evolutionarily conserved regions associated with RNA processing.

Of
the 42 variants most likely to cause disease (so-called "truncating"
variants) 38 were found in only one family, and these tended to cluster
together in specific genes - for instance, one gene contained 5
different rare, damaging mutations. However, many of these variants were found in both patients and their healthy male siblings,
suggesting that they are not causative in mental retardation. These
genes could represent subtle predisposing factors for mental
retardation, but it's likely that most of them are simply genes that
can be inactivated with little or no deleterious consequences for
humans.

Overall, only nine genes showed strong evidence for
disease-causing mutations. The researchers went on to sequence these
genes in a further 914 mental retardation patients and over a thousand
controls, but found only a handful of likely disease-causing mutations
in these genes in other patients.

Although the technical
achievement is impressive, the picture from this survey is somewhat
depressing (although not really surprising) for researchers interested
in using large-scale sequencing to discover disease-causing variants.
It's a clear demonstration that even examining the majority of protein-coding sequence will be insufficient to capture most of our nasty genetic secrets
- many of these lurk deep in non-coding DNA, while a fair chunk of the
remainder simply hide in the biological noise resulting from all of the
other non-disease-causing variants in the genome. In this study it's
likely that the researchers have actually uncovered a fair number of
disease-causing mutations (for instance, among the almost 1000
protein-altering variants) but are currently simply unable to
distinguish them from benign polymorphisms.

What's the solution? More sequencing,
for a start - digging deep into the non-coding portions of the genome,
and also ensuring very accurate coverage of the protein-coding portions
(in this study an average of just 75% of the targeted regions were
actually successfully sequenced in any given individual). This is
already entirely feasible due to the emergence of second-generation
sequencing, and will become rapidly more affordable as sequencing costs
drop. Already there are research groups around the world planning
massive sequencing studies to identify rare mutations underlying severe
diseases.

But sequencing won't be enough: we need much better methods for sifting out the truly function-altering genetic variants from the biological noise.
This is already difficult enough for protein-coding regions (as this
study demonstrates); we currently have virtually no way of picking out
disease-causing variants in the remaining 98% of the genome. There's a
clear need for developing highly accurate and comprehensive maps of the
functional importance of each and every base in the human genome
,
using all of the tools at our disposal - something that will keep us
geneticists busy long after we've run out of genomes to sequence.

More like this

The textbook explanation of DNA goes something like this: enzymes in our cells read a stretch of DNA and convert its code into a single-stranded RNA molecule, which is then used by ribosomes as a template for building a protein. That stretch of DNA biologists call a gene. The protein it encodes…
Two big studies on genetics came out in the past couple weeks, and I want to talk about both. One of them -- the ENCODE study -- was well covered by the media. The other seems to have slipped through. Paper #1: In the ENCODE study, the authors compiled data using a variety of experimental…
If you missed it, today's NY Times Science section has been dedicated to "The Gene" a concept invented 99 years ago by Wilhelm Johanssen. Overall, the articles were very good, however as a scientist who wants to explain basic concepts of molecular biology to the masses, I have a few problems. First…
Getting an accurate genome sequence requires that you collect the data at least twice argue Robasky, Lewis, and Church in their recent opinion piece in Nat. Rev. Genetics [1]. The DNA sequencing world kicked off 2014 with an audacious start. Andrew Pollack ran an article in the New York Times…

A program developed by Cornell researchers deduced the natural laws without a shred of knowledge about physics or geometry. The research is being heralded as a potential breakthrough for science in the Petabyte Age, where computers try to find regularities in massive datasets that are too big and complex for the human mind."
http://blog.wired.com/wiredscience/2009/04/newtonai.html

Programs like these are in their infancy but they are enjoying some successes. It isn't just genetics that is drowning in data.