New Stuff from Genome Research

Looks like this season's lecture series has started.

Yesterday evening I saw a talk by Eric Lander, head of the Broad Institute. Now normally I do not blog about my results and I do not blog about what I hear at seminars. It just doesn't feel right. Scientists work very hard at obtaining results and I don't want to start telling the world about their preliminary data. But I can give you some "factoids" from the talk ... very interesting stuff. (In addition, it would be very hard to scoop anything done at the Broad.)

He started off his talk with the neatest analysis of the 20th century. Gregor Mendel discovered the laws of heredity and then was ignored and thus "dropped out" of science. His discovery, the nature of genetic information, was rediscovered in 1900 by three independent groups. The rest of the 20th century can be divided into four. The first quarter was devoted to figuring out that genetic information lay in the chromosomes. The second quarter was devoted to the nature of the molecule that carried the information. This culminated in 1952 with the Watson, Crick DNA structure. The third quarter was devoted to how DNA information was converted into protein (DNA=>RNA=>protein). The last quarter was devoted to figuring out what are all the major genes and their roles and culminated at the end of the 20th century with the completion of the human genome project. It's quite amazing to thing about the development of biology in these terms.

Other interesting tidbits from his talk:
- The latest count for the human genome ... (drum roll) ... 19,300 genes. How far down can the count go! Of these there is good evidence that at least >18,000 of these are translated into proteins, so were getting close to the real number.
- About 5% of the genome consists of highly conserved nuclear sequences. But here is the kicker, only about one third of these are in protein coding regions, the rest are of unknown function. Some may be DNA regulatory elements, some may encode RNAs of unknown function. In one example, many of these elements lie within the intronic sequences of a gene.
- Most of these highly conserved DNA regions cluster in gene poor areas of the chromosome. Well what genes happen to be there? Genes that control early development.
- By the end of the year, the genome of 24 species mammals will be completed! (Yes incredible, considering that at the start of the genome project, it was estimated that the human + mouse genomes would be done by 2010.)

(OK this next one is from a David Spector talk that I attended Monday, but very interesting.)
- Only 30% of transcriptionally active DNA contains open reading frames (i.e. encodes proteins). A full 70% of DNA is converted to non-translationally active RNA ... currently the function for most of these RNA species is unknown.

Getting back to Lander's talk, there was so much more (specifics were impressive but omited ... you can read about them when they are published):
- Figuring out all the variation in the human population (variation or single nucleotide polymorphisms SNPs).
- Correlating SNPs to find ancestral segments of DNA - or as his website states, "referred to as blocks of linkage disequilibrium, or LD". Also known as haplotype - the map of theses haplotypes is the much talked about Hap Map.
- Genotypic individuals on a DNA chip that measures 500,000 SNPs/LDs.
- Correlating SNPs/LDs with disease.
- Correlating RNA expression levels between disease states and drug treatment.

The list of things going on at the Broad goes on ... In summary, a great talk.

More like this

Two big studies on genetics came out in the past couple weeks, and I want to talk about both. One of them -- the ENCODE study -- was well covered by the media. The other seems to have slipped through. Paper #1: In the ENCODE study, the authors compiled data using a variety of experimental…
I finally read the huge Nature paper that everyone has been talking about, the ENCODE project, or the encyclopedia of DNA Elements. ENCODE is a large scale concerted effort whose goal is to understand how the genome is used, maintained and conserved. In other words, what parts of the genome get…
If you missed it, today's NY Times Science section has been dedicated to "The Gene" a concept invented 99 years ago by Wilhelm Johanssen. Overall, the articles were very good, however as a scientist who wants to explain basic concepts of molecular biology to the masses, I have a few problems. First…
Where the variation comes from. Evolution proceeds by the action of many different evolutionary forces on heritable variation. Natural selection leads to the increase in frequency of variation that allows individuals to produce more offspring who, themselves, produce offspring. Genetic drift…

Damn ... I missed it.

By Acme Scientist (not verified) on 15 Sep 2006 #permalink

24 species? Hmm, seems like you have forgotten everything besides mammalian cells. I imagine that Lander had the good sense to at least say 24 mammalian genomes. You cell biologists are all the same... http://cmr.tigr.org currently lists 305 Bacterial genomes, an untold number of yeast/fungi, parasites, plants, insects, the list goes on.

Alex says,

Only 30% of transcriptionally active DNA contains open reading frames (i.e. encodes proteins). A full 70% of DNA is converted to non-translationally active RNA ... currently the function for most of these RNA species is unknown.

I'm not sure what data was presented but I suspect it was based on so-called "expressed sequence tags" (ESTs). If so, the evidence shows that a good chunk of the genome is covered by rare ESTs that match regions that are well outside of the boundaries of identified genes.
The question we need to address is whether these ESTs actually represent transcriptionally active regions of the genome or whether they are artifacts. It's not a good idea to just assume that all ESTs identify real transcripts that have a function. There are too many other possibilities such as accidental transcription or even DNA contamination.
For people who are interested in transcription this has to be one of the most exciting controversies in the entire field. Alex, what's your opinion? Do you think all ESTs represent bone fide transcripts?

By Larry Moran (not verified) on 15 Sep 2006 #permalink

Sorry Bil, 24 mammals. (In fact he said "in two more days we'll have the horse completed.")

Larry,

I'm not sure where Spector got his data from. (Someone in the lab thought he even said 70% of the entire genome is transcribed!) As for the EST data, I tend to believe that most of it is derived from RNA (i.e. it's real), although I would trust David Spector's opinion more than my own and seeing that we are both cell biologists ...

What are your thoughts on the matter?

Alex, I don't think that's what Larry was getting at. It's possible that 70% of the genome is transcribed. It's also possible that 70% of transcribed sequences are untranslated. But what fraction of transcribed sequences are functional? In other words, what fractions are mistakes by the cell (not mistakes in identifying ESTs)? How much of the transcription that occurs in a given nucleus is results in a transcipt that has no function? This is a very interesting question.

RPM,

Yeah I know, just answering the "just DNA contamination" part. As for "are they are translated", most of the 70% does not contain real ORFs or contain dead ORF (as far as I know), but I would have to see the methodology to determine how Lander et al. are making the statement "these are not translated". But my understanding is that gene death is much more common than we once thought and that many ORFs are non functional but we just haven't turned them off completely. Also RNAs are doing more and more on their own (thus my comment "the function is unknown"). There seems to be all these "weird" RNAs popping up that have no real ORF but must play some role in cellular metabolism. My hunch is that we are more of an RNA beast than we currently realize. I have also confirmed from others that Spector did say 70% of the genome is transcribed. If I get around to it I'll email him to ask him where he got that.

If any of you know, I'd be glad to find out.

Hum, I've also heard that 70% of the whole genome figure, but can't remember where.

I liked the part about the key ideas guiding the different quarters of the 20th century... Sometimes we have the feeling that things are accelerating, and sure, the accumulation of data is. So, these equivalent time spans for roughly comparable succesive steps of understanding, help to put the whole story in better perspective.

And a natural question then is: what about the next two decades? I guess some will say that now it'll be about applying the knowledge: meddling with molecular biology. Others may be more cautious and say that we still need a couple of decades just to understand the data we have.

I'd bet for another one (which doesn't exclude the last), along the lines of Alex's last comment. I guess we will find out that we have been working with the wrong paradigm, and that cells are best understood as RNA communities. That's how they started, and that's what they still are. RNAs use proteins as tools, and DNA as library, of course. But they keep the wheel in their little multifunctional hands. Ribozimes, microRNAs? Just the tip of the iceberg.

Other guesses? What will be the comments in the blogs of 2025?

[I'm a bit sleepy; please excuse any Spanglish above]

apalazzo asks,

As for the EST data, I tend to believe that most of it is derived from RNA (i.e. it's real), although I would trust David Spector's opinion more than my own and seeing that we are both cell biologists ...

What are your thoughts on the matter?

I've been looking at the so-called "evidence" for alternative splicing. Most of it is based on EST data and it doesn't make any sense at all. In those cases where we're dealing with well-studied genes we can easily reject all of the predicted alternative splicing. It's all artifact based on EST data that is unreliable.
In this case most of the ESTs seem to be derived from real RNA that results from incorrect processing but there are also examples that suggest accidental transcription. The ESTs derived from non-coding strands could be due to DNA contamination.
If this EST data is unreliable then what about the data in genes that we don't know much about? And what about EST data in the rest of the genome? There's only one logical conclusion and that's to admit that the EST data is, at the very least, questionable. You can't just assume that an EST identifies a real transcript. You have to prove it.
Fifteen years ago we were told that ESTs would show us where the protein-coding genes were. That turned out to be spectacularly wrong by an order of magnitude. Now we're being told that species with large genomes have evolved entirely new mechanisms of gene regulation involving little bits of RNA that can only be detected in EST libraries. This sounds an awful lot like special pleading by people who have invested a great deal of effort in ESTs. It's time to step back and ask ourselves whether the data is good. It's time to start thinking about the difference between good science and not-so-good science.

By Larry Moran (not verified) on 16 Sep 2006 #permalink

In this case most of the ESTs seem to be derived from real RNA that results from incorrect processing but there are also examples that suggest accidental transcription.

I agree that there is probably lots of ESTs of dead genes, uncommon splicing events and other non-functional genomic areas. I wonder how often that happens in a typical cell. Has anyone ever tried to find out how often mistakes occur? Or in other words how sloppy are cells when it comes to transcription? In a way sloppiness can promote experimentation (i.e. promote evolvability). There is also a big problem in trying to prove a negative, as in "this EST is obviously a mistake". Having said that, there are clearly many bits of RNAs that have no ORF but play some biological role. Now how many of these "special" RNAs there are, I have no clue but it could be quite a bit.

Perhaps this is the nature of the next "step", due in 2025, determining that we are actually mostly an RNA beast. Many studies have indicated that poly-adenylated RNAs (i.e. mRNA) represents only 10% of all RNA in a typical cell. Of course there is a ton of tRNA and rRNA, but now there are all these questions popping up about other RNAs. At least the small RNAs (miRNA, piRNAs) I think are very real. We'll have to see if the rest is all junk or if many of these RNAs are vital to cellular metabolism. My guess is that it'll be somewhere in between. (Or perhaps these RNAs represent a backup genome as studies in plant germline cells indicate???)