There was a time not that long ago when sequencing a single gene would be hailed as a scientific milestone. But then came a series of breakthroughs that sped up the process: clever ideas for how to cut up genes and rapidly identify the fragments, the design of robots that could do this work twenty-four hours a day, and powerful computers programmed to make sense of the results. Instead of single genes, entire genomes began to be sequenced. This year marks the tenth anniversary of the publication of the first complete draft of the entire genome of a free-living species (a nasty little microbe called Haemophilus influenzae). Since then, hundreds of genomes have emerged, from flies, mice, humans, and many more, each made up of thousands of genes. More individual genes have been sequenced from the DNA of thousands of other species. In August, an international consortium of databases announced that they now had 100 billion "letters" from the genes of 165,000 different species.
But this data glut has created a new problem. Scientists don't know what many of the genes are for.
The classic method for figuring out what a gene is for is good old benchwork. Scientists use the gene's code to generate a protein and then figure out what sort of chemical tricks the protein can perform. Perhaps it's good at slicing some other particular protein in half, or sticking two other proteins together. It's not easy to tackle this question with brute force, since a mystery protein may interact with any one of the thousands of other proteins in an organism. One way scientists can narrow down their search is by seeing what happens to organisms if they take out the particular gene. The organisms may suddenly become unable to digest their favorite food or withstand heat, or show some other change that can serve as a clue.
Even today, though, these experiments still demand a lot of time, in large part because they're still too complex for robots and computers. Even when it comes to E. coli, a bacterium that thousands of scientists have studied for decades, the functions of a thousand of its genes remain unknown.
This dilemma has helped give rise to a new kind of science called bioinformatics. It's an exciting field, despite its woefully dull name. Its mission is to use computers to help make sense of molecular biology--in this case, by traveling through vast oceans of online information in search of clues to how genes work.
One of the most reliable ways to find out what a gene is for is to find another gene with a very similar sequence. The human genes for hemoglobin and the chimpanzee genes for hemoglobin are a case in point. Since our ancestors diverged about six million years ago, the genes in each lineage have mutated a little, but not much. The proteins they produce still have a similar structure, which allows them to do the same thing: ferry oxygen through the bloodstream. So if you happen to be trolling through the genome of a gorilla--another close ape relative--and discover a gene that's very similar to chimpanzee and human hemoglobins, you've got good reason to think that you've found a gorilla hemoglobin gene.
Scientists sometimes use this same method to match different genes in the same genome. There isn't just one hemoglobin gene in humans but seven. They carry out different slightly functions, some carrying oxygen in the fetus, for example, and others in the adult. This gene family, as it's known, is the result of ancient mistakes. From time to time, the cellular machinery for copying genes accidentally creates a second copy of a gene. Scientists have several lines of evidence for this. Some people carry around extra copies of genes not found in other people. Scientists have also tracked gene duplication in laboratory experiments with bacteria and other organisms.
In many cases, these extra genes offer no benefit and disappear over the generations. But in some cases, extra genes appear to provide an evolutionary advantage. They may mutate until they take on new functions, and gradually spread through an entire species. Round after round of gene duplication can turn a single gene into an entire family of genes. Knowing that genes come in families means that if you find a human gene that looks like hemoglobin genes, it's a fair guess that it does much the same thing as they do.
This method works pretty well, and bioinformaticists (please! find a better name!) have written a number of programs to search databases for good matches between genes. But these programs tend to pick the low-hanging fruit: they are good at recognizing relatively easy matches and not so good at identifying more distant cousins. Over time, related genes can undergo different mutations rates, which can make it difficult to recognize their relationship simply by eyeballing them side by side. Another hazard is the way a gene can be "borrowed" for a new function. For example, snake venom genes turn out to have evolved from families of genes that carry out very different functions in the heart, liver, and other organs. These sorts of evolutionary events can make it hard for simple gene-matching to yield clues to what a new gene is for.
To improve their hunt for the function of new genes, bioinformaticists are building new programs. One of the newest, called SIFTER, was designed by a team of computer scientists and biologists at UC Berkeley. They outline some of their early results in the October issue of PLOS Computational Biology (open access paper here). SIFTER is different than previous programs in that it relies on a detailed understanding of the evolutionary history of a gene. As a result, it offers significantly better results.
To demonstrate SIFTER's powers of prediction, the researchers tested it on well-studied families of genes--ones that contained a number of genes for which there was very good experimental evidence for their functions. They used SIFTER to come up with hypotheses about the function of the genes, and then turned to the results of experiments on those genes to see if the hypotheses were right.
Here's how a typical trial of SIFTER went. The researchers examined the family of (big breath) Adenosine-5'-Monophosphate/Adenosine Deaminase genes. Scientists have identified 128 genes in this family, in mammals, insects, fungi, protozoans, and bacteria. With careful experiments, scientists have figured out what 33 of these genes do. The genes produce proteins that generally hack off a particular part of various molecules. In some cases, they help produce nitrogen compounds we need for metabolism, while in other cases they help change the information encoded in genes as it is translated into proteins. In still other cases they have acquired an extra segment of DNA that allows them to help stimulate growth.
The SIFTER team first reconstructed the evolutionary tree of this gene family, calculating how all 128 genes are related to one other. The shows how an ancestral gene that existed in microbes billions of years ago was passed down to different lineages, duplicating and mutating along the way. The researchers then gave SIFTER the experimental results from just five of the 128 genes in the family. The program used this information to infer how the function of the genes evolved over time. That insight then allowed it to come up with hypotheses about what the other 123 genes in the family do.
Aside from the 5 genes whose function the researchers had given SIFTER, there are 28 with good experimental evidence. The scientists compared the real functions of these genes to SIFTER's guesses. It got 27 out of 28 right.
SIFTER's 96% accuracy rate is significantly better than other programs that don't take evolution so carefully into consideration. Still, the Berkeley team cautions that they have more work to do. The statistics that the program uses (Bayesian probability) get harder to use as the range of possible functions gets bigger. What's more, the model of evolution that it relies on is fairly simple compared to what biologists now understand about how evolution works. But these aren't insurmountable problems. They're the stuff to expect in SIFTER 2.0 or some other future upgrade.
Those who claim to have a legitimate alternative to evolution might want to try to match SIFTER. They could take the basic principles of whatever they advocate and use them to come up with a mathematical method for comparing genes. No stealing any SIFTER code allowed--this has to be original work that doesn't borrow from evolutionary theory.
They could then use their method to compare the 128 genes of the Adenosine-5'-Monophosphate/Adenosine Deaminase family. Next, they could take the functions of five of the genes, and use that information to predict how the other 123 genes work. And then they could see how well their predictions were by looking at the other 28 genes for which there's good experimental evidence about their function.
All the data to run this test is available for free online, so there's no excuse for these antievolutionists not to take the test. Would they match SIFTER's score of 96%? Would they do better than random? I doubt we'll ever find out. Those who attack evolution these days aren't much for specific predictions of the sort SIFTER makes, despite the mathematical jargon they like to use. Until they can meet the SIFTER challenge, don't expect most scientists to take them very seriously.
Identifying the functions of genes is important work. Scientists need to know how genes work to figure out the causes of diseases and figure out how to engineer microbes to produce insulin and other important molecules. The future of medicine and biotech, it appears, lies in life's distant past.
Update Monday 10:30 am: John Wilkins says that bioinformatician is the proper term, although no improvement. I then googled both terms and found tens of thousands of hits for both (although bioinformatician has twice as many as bioinformaticist). Is there an authority we can turn to? And can it try to come up with a better name? Gene voyagers? Matrix masters?
- Log in to post comments
Thank you. Waaaay cool!
Very nice. FYI, the professional name is "bioinformaticians", which is absolutely no improvement.
This supports the use of phylogenies as inductive tools, rather than as hypotheses (which of course they also are) - if you know the properties of two branches of a clade, then you can infer with high probability for any deep property (i.e., not just a specific adaptation) what the properties of all the intervening branches are. Contrast this with the pre-phylogenetic approach to classification based on grades - there was little inference apart from the characters used to create the grade.
Carl, I desperately need someone to explain in nontech talk what a microarray is and how it works. Over to you...
RE: John Wilkins
Funny you should mention microarrays - i'm working on a graduate genetics lecture, and i'm doing the array slides right now. If you're interested, i can email them to you with a brief description when i'm done. You can email me at jrtimmer-at-gmail.com if you're interested.
Otherwise, sifter sounds great. I wonder how well it copes with non-biochemical families, like transcription factors or ECM components, which don't have such straightforward functions and don't exist in bacteria.
excellent peek into a bioinformatician's :) world!
Microarrays, simplistically, are chips that have a spot for gene (gene is treated with a flourescent dye) and use a laser and light intensity to measure if a particular gene is active in a test condition. Light intensity varies depending on whether a gene is over active/"expressed", underexpressed or not affected. There are different chips made by different companies with different technologies, but the technology is based on this basic principle.
Hey, how about biominers? That's what they do--digging through gene sequences for nuggets of information.
It would have to be pronounced differently than bio-miners, to have a ring of authority...
How 'bout BiOminers? Accent on the "O", soft "i".
I think it'll catch on! ;)
Isn't "what is a gene for" really "what is this gene for in the context provided by all the other genes"? There is a similar context for the operation of a human gene and its chimpanzee analog, again explained by evolution. Or does the protein that a gene encodes really have a function independent of context?
At the risk of detracting from the actual content of this interesting article, I wonder what the ID/Creationist crowd would make of a post like this. It says nothing about the "debate" over of evolution/creationism, and yet it is impossible to make sense of this post outside of an acceptance of evolutionary theory. When scientists say that evolution is the central concept underlying all modern biology, this is the sort of thing they are talking about.
Perhaps the good folks in Dover, PA might read a few more articles like this, and ask themselves if a creationist account of life would adequately prepare their students for this type of study. Can you possibly talk about the similarity between ape and human gene lineages if you believe this similarity is simply a coincidence, and not due to the fact of a common ape/human ancestor? How can you cure a disease if you believe the immune system was created by a miracle?
Reading these articles might also remind the Dover School Board that scientists are not trying to undermine their faith by insisting that evolution be taught in high school. It would be hard to read this article and think Carl was attempting to replace your religion. Science is about understanding genes and atoms, not the meaning of life.
I guess I would have gone with the unlovely "computational biology", on analogy with "computational linguistics". Then, of course, a practitioner is a computational biologist. At least this proposal clarifies that the field is part of biology, and that practitioners are a kind of biologist.