You Don't Miss Those 8,000 Genes, Do You?

i-a231b94d56b007a54f0a0befdb0452cf-PANTHER GRAB.jpgScience moves forward by flow. One experiment leads to another. Observations accrue. What seem like side trips or even dead ends may bring a fuzzy picture further into focus. Yet science often seems as if it moves forward one bombshell at a time, marked by scientific papers and press conferences. I can't think of a bigger contrast between the bombshell illusion and the flowing reality of science than the day in 2000 when President Clinton announced the completion of the first draft of the human genome on the White House lawn. He declared it "an epoch-making triumph of science and reason." And yet, as James Shreeve described in his book The Genome War, the announcement did not mark any clear milestone, but represented more of a last-minute compromise between two rival genome-sequencing teams. After the press conference, everyone went back to perfecting the rough drafts, which turned out to be very rough indeed. In fact, they're still polishing it, years after many people may have gotten the impression they were actually finished.

When Craig Venter and his colleagues published their rough draft of the human genome in 2001 they identified 26,588 human genes. They then broke those genes down by their functions. Some were involved in building DNA, some in relaying signals, and so on. Remarkably, though, they classified 12809 genes--almost half--as "molecular function unknown." Last week I wanted to know if those numbers still hold. I've been working on a book on Escherichia coli, and I wanted to contrast just how well scientists understand that microbe to just how poorly we understand ourselves (biologically, in this case). I wanted some numbers to make my case.

They weren't so easy to find. In 2003 some reports came out to the effect that the genome had shrunk down to 21,000 genes. But I couldn't turn up much news in the past four years. I wondered what sort of artificial milestone I would have to wait for in order to get some fresh numbers.

Fortunately there are now some rivals to the milestone model of science. There are web sites where you can observe works in progress, such as the human genome. One of those sites is called PANTHER. I contacted the top scientist behind it, Paul D. Thomas, with my question, and he sent me a link. When I clicked on the link, I got the pie chart I've posted here (click on the image to go to the original page if it's hard to read).

The pie shows that we're now down to just 18,308 genes. That's over 8,000 genes fewer than six years ago. Many sequences that once looked like full-fledged genes, capable of generating a protein, now don't make the grade. Some genes turned out to be pseudogenes--vestiges of genes that once worked but have been since wrecked by mutations. In other cases, DNA segments that appeared to be parts of separate genes have turned out to be part of the same gene.

Today scientists still don't know the function of 5898 genes in the human genome. In other words, over the past six years about 7,000 genes either have been figured out or have vanished into the land of nevermind. That's progress, of a sort. But unknown genes still represent a major slice of the human genome, because the total number of genes has fallen as well. The blue slice in the pie above represents 32.2% of all our known genes. For all the work that has poured into the genome, for all the grand announcements, we still don't know have the faintest idea of what about a third of our genes are for.

Actually, we don't even know all that much about the "known" genes. A lot of the functions assigned to human genes actually come from research on other species. We share a common evolutionary history with mice and Drosophila flies and other organisms that scientists have studied carefully. We all descend from a common ancestor, but when our lineages diverged, those ancestral genes duplicated and diverged. Some disappeared and others took on new roles. It's possible now to group the genes from many species into families. Within those families scientists can group genes into sub-families. Genes from the same sub-family tend to do the same thing, even if they are found in different species. So PANTHER assigns human genes functions that have been established for genes from the same sub-family through careful experiments on other organisms. That's a good strategy, but the fact remains that few human genes have experimental evidence for their function in humans. In one study of 35329 proteins, scientists estimated that only 2784 met this gold standard.

That 35,329 figure may seem confusing, since we only have 18,308 genes. A single human gene can make more than one protein. Human genes come in pieces, separated by non-coding chunks of DNA, and those segments can be spliced together in different combinations. Scientists will discover many more splice variants. Each splice can have a significantly different function than other proteins produced by the same gene. This pie doesn't capture that dimension of our knowledge (or our ignorance).

And then there's the whole matter of all the other DNA that doesn't encode proteins (98.5% of the genome all told). A lot of it is most likely a mishmash of broken genes and viral DNA. It's possible to cut huge swaths of it out of a mouse's genome with no apparent ill effect. But there are also a lot of important players hiding in that wilderness--switches that proteins can use to turn genes on and off, sequences that do not give rise to proteins but rather RNA molecules that create their own control system for a cell. In all of these complications, scientists will probably find the answer to the question, "How do roughly the same number of genes encode such different kinds of animals?" Complexity isn't purely a matter of the number of genes you have. It's also how you use them.

Getting an update on the human genome was interesting in itself, but the way I got it was interesting as well. I did not have to follow the traditional procedure, waiting for highly guarded paper to finally be published and reported on. The latest statistics on the human genome are out there now for anyone who cares to look at them. But in order to get at this information, you do need a fair amount of acumen. I would not have been able to have created this pie chart without Thomas's help. Perhaps some science writers will become more like investigative political reporters who know how to sift through Federal election databases for the real news. If that gets us away from the illusion of the bombshell, it will be a good thing.

More like this

There was a time not that long ago when sequencing a single gene would be hailed as a scientific milestone. But then came a series of breakthroughs that sped up the process: clever ideas for how to cut up genes and rapidly identify the fragments, the design of robots that could do this work twenty-…
Trace your genealogy back 25 million years, and you'll meet long-tailed monkey-like primates living in trees. Those primates were not just the ancestors of ourselves, but of all the other apes--chimpanzees, bonobos, gorillas, orangutans, and gibbons--along with the monkeys of the Eastern Hemisphere…
In a recent posting, Rusty answers me once again on the issue of testability. He proposes an actual test for both creationism and evolution. This is what he says: But in the strictest sense of the term testability, a falsifiable prediction must be made in order for a scientific theory to be…
The textbook explanation of DNA goes something like this: enzymes in our cells read a stretch of DNA and convert its code into a single-stranded RNA molecule, which is then used by ribosomes as a template for building a protein. That stretch of DNA biologists call a gene. The protein it encodes…

Perhaps some science writers will become more like investigative political reporters who know how to sift through Federal election databases for the real news. If that gets us away from the illusion of the bombshell, it will be a good thing.

Most interesting. I have some relevant training (1 science degree, 1 math degree), but I'm not a scientist nor a science writer. As an "informed lay reader," I've become skeptical (going on cynical) about the "bombshell" announcements and articles. If you ask yourself the question, "Is there less here than meets the eye?", it's usually not too hard to answer, "Yes."

Example - the ongoing Homo floresiensis (hobbit) story. Fascinating, but I can recall suspecting that the story when it first broke was far from as clear as was being implied, and that it would go on for some time before any consensus emerged. And I don't claim that such a conclusion took a lot of special insight.

Since that time, there's been the occasional smaller bombshell as disagreements emerged, but the most useful information I've come across has been from the minority group of "well-informed science writers/reporters" who can (as you suggest) dig out what's happening in between the cries of "Eureka!" and write the less flashy but more informative "update" articles.

Those people are not often the ones who write the initial "bombshell" articles, but they are the ones to read if you want to get behind the headline.

By Scott Belyea (not verified) on 19 Mar 2007 #permalink

Those people are not often the ones who write the initial "bombshell" articles, but they are the ones to read if you want to get behind the headline.

Oh, all right ... you are definitely one of them. :-)

By Scott Belyea (not verified) on 19 Mar 2007 #permalink

Are we seeing the lines blurring between scientists, science-writers, and lay-people? A few of us around the blogosphere have been arguing for more armchair bioinformaticians -- kind of like armchair astronomers, but gazing at genomic data on a computer screen rather than up at the sky through a telescope. Just like the armchair astronomers identify new celestial bodies, armchair bioinformaticians can dig through publicly available genome data and find interesting trends.

On a tangential note, I'm somewhat unimpressed with PANTHER. Just for giggles I searched for a gene I work on, the transcription factor MNT. It came up as "molecular function unclassified" even though you don't have to know anything more than its sequence to know that it's a DNA binding protein (it contains a basic helix-loop-helix leucine zipper motif). Not only that, but it's been recognized as a transcription factor for probably ten years, and the database apparently knows that it's related to a number of proteins of fairly well-defined function, Max and the Mdx group (which it calls "Mad"s even though the name changed a year or more ago). I say "apparently" because you need the latest whiz-bang flash or java or something just to make the site work properly; those of us who cannot afford to upgrade our computers every year find such design rather less than useful. (Pre-emptive: I know flash and java are not hardware; I also know that I cannot regularly upgrade to the latest versions of such things on this old Mac without causing horrible things to happen. I'm still using Firefox 1.0.something, because later versions won't run on this machine.)

Not impressed with PANTHER? I bet you don't know what's known about ALL human genes either. So, is PANTHER updateable with your input? If not, then clearly some wiki should hold it. You need only enter the stuff you know.

Don't feel like creating and maintaining your own wiki? Start a branch in wikipedia. Don't like their politics? Start a branch in conservapedia!

Disclaimer. I've never had an original idea. By the time i think of something new, someone else has already done it.

Perhaps some science writers will become more like investigative political reporters who know how to sift through Federal election databases for the real news. If that gets us away from the illusion of the bombshell, it will be a good thing.

Sounds like we need to do two complementary things:

(a) Make it much easier to do the sifting, and
(b) Hold the journalists who don't do their job accountable.

Implementing part (b) is left as an exercise for the interested reader.

Your link http://www.pantherdb.org/chart/summary/pantherChart.jsp?filterLevel=1&c… contains some filter (NP) that refers to RefSeq data.
From the summary of the RefSeq handbook (available on-line at NCBI)

The Reference Sequence (RefSeq) database is a non-redundant collection of richly annotated DNA, RNA, and protein sequences from diverse taxa.

Thus, you may have missed genes for which no annotated counterparts are available. When you go to http://www.pantherdb.org/chart/summary/pantherChart.jsp?filterLevel=1&c… the displayed gene number is 25431. This is close to ENSEMBL's human gene number of 21,662 and the number given by the VEGA consortium that manually annotates the human genome (32,235 genes).

Stephen, #5:

> Not impressed with PANTHER? I bet you don't
> know what's known about ALL human genes either.

Of course not -- all the more reason to expect the little I do know to be a subset of what PANTHER knows.

> So, is PANTHER updateable with your input?

It is, but I am not sure whether I can add a molecule without situating it in a defined pathway (which, for MNT, is not yet possible).

> If not, then clearly some wiki should hold it.

I don't think I agree -- PANTHER is a rather different sort of beast from Wikipedia, or any other wiki. My little squib of information about MNT is available to any human who cares to look through PubMed (for instance), and putting it on a wiki would probably not help anyone much. The point of PANTHER and similar efforts, as I understand them, is to create machine-readable databases so that computing power can be applied to the elucidation of patterns too complex and/or subtle for brain power to find in a reasonable time frame.

Much as I am enjoying the discussion about PANTHER (which I have used on and off) - I want to revert to another discussion thread from Zimmer's post. This is the idea of science writers becoming more like investigative reporters. There are of course some who do this already, but I agree that it is not overly common. It would truly be exciting if this were to become a bigger trend. In general, I think that it is becoming easier to get to the story behind the story in science, through blogs, through Open Access publications, through data archives, and through perhaps more detailed reporting.

I guess when I started in science my expectation was that all reporters would be "investigative" reporters. And then I saw that for a decent fraction of science news, the articles were basically reprints of press releases. Sure there are some reporters who look deeper, but not too often. I guess maybe this is why I love Science Friday (I listed to the podcasts on my bike commute) and the Science Times on Tuesdays.

Anyway - I look forward to 1000s of new Zimmer wanna-bes at all the major news outlets.

Well....it depends on what you mean by "gene" of course. Expanding beyond protein coding sequences, the number of genes will go up. But maybe "gene" is an outmoded concept anyway. If one wants to tie "gene" to "function" (as well as to stability, transmissability, mutatability), then thing get dicey. For example, many polypeptides are only functional within the context of protein complexes or "molecular machines."

By sequencer lee (not verified) on 20 Mar 2007 #permalink

Very timely for me! I just showed the now-old "Cracking the Code" NOVA special in my freshman science/technology/society class. I'll be glad to have this new information for an update.

Thanks - this is just the kind of thing that keeps me reading Scienceblogs.

By John Monfries (not verified) on 21 Mar 2007 #permalink

Maybe "bombshell" is bit hyperbolic (pardon the pun), but if you modeled the importance of a series of scientific ideas as an exponential decay function of when they occur ("diminishing returns"), I think that would still jibe with the insight that there does tend to be a few big ideas, although afterward there still remains lots of work of lesser importance (on a single-idea level -- cumulatively the importance of all the later work is nothing to sneeze at).

Interesting... so if sparc is right, the gene count actually hasn't changed much since 2003. Easy to see how CZ might have made the error, but it will be interesting to see if the 18,308 number now crops up elsewhere on the web or in the media!

It's another one of those problems that plagues science news -- the "Mythical Number," the oft-quoted number that someone came up with based on faulty info or reasoning. ("Eskimos have 100 words for snow," "we only use 5% of our brains," etc.) Once one starts to propogate, it's really hard to stamp out. And given the web's interconnectedness, I bet they get going more quickly than ever nowadays...

By Andrew W. (not verified) on 04 Mar 2008 #permalink