 I finally read the huge Nature paper that everyone has been talking about, the ENCODE project, or the encyclopedia of DNA Elements. ENCODE is a large scale concerted effort whose goal is to understand how the genome is used, maintained and conserved. In other words, what parts of the genome get
I finally read the huge Nature paper that everyone has been talking about, the ENCODE project, or the encyclopedia of DNA Elements. ENCODE is a large scale concerted effort whose goal is to understand how the genome is used, maintained and conserved. In other words, what parts of the genome get translated transcribed into RNA, what do all these transcripts codes for, where are all the transcriptional start sites, what parts control gene expression (RNA production), how are histone modifications distributed across the genome, how are various DNA binding proteins distributed across the genome, what parts of the genome are conserved across mammals and finally how do these various properties correlate (for example, are all the conserved bits of the genome the same parts that are translated into protein?) 
In the pilot study, the ENCODE group investigated 1% of the genome. The 1% was not a contiguous sequence but made up of many small (but in fact large by any scale) bits. Half of the genomic fragments were chosen at random, half chosen because of biological significance.
So what is the main result?
Well although most in the science media (and in pseudoscientific circles) have claimed that the results are surprising, in actual fact most of the significant results are really validations of little bits of data that have been floating around scientific circles for a couple of years. So what's up with the genome?
Now I'll talk about the results in brief and in a different way that was presented in the paper. I'll also skip over the DNA replication results.
1) 70-90% of the genome is transcribed into RNA. This observation basically validates previous findings. I've written about this issue before, here, here, here and here. Most of the genome is represented in primary transcripts, that is RNA coming right off of RNA polymerase.
2) When looking for transcriptional start sites (i.e. places where RNA polymerase engages the DNA and begins transcribing RNA) the ENCODE team found that there are about 10 times as many start sites as there are annotated genes. These transcriptional start sites are loaded with RNA polymerase, various transcription factors (such as tata binding factor) and chromatin remodelling proteins (factors that move nucleosomes and allow the DNA to take an "open" conformation). These start sites correlate very nicely with histone modifications and DNAse sensitivity. So the start sites are places that bind to the correct histone modifications the correct transcription initiation factors and to RNA polymerase. That's very good. But why all the extra transcriptional start sites? (More on that later.)
3) When isolating long-lasting RNAs with the proper cap and poly-A tails, these transcripts originated from 5% of the representative portion of the genome that was analyzed. Remember that known genes represent ~2% of the genome. Indeed, of the different types of transcripts, 40% of the transcripts found came from parts of the genome that code for known proteins, another 40% came from intronic sequences (these are bits that are usually spliced out of pre-mRNAs) and another 20% came from portions of the genome that do not code for any known gene. So the question is, what are these extra RNA bits?
Now just to remind you, there are many types of functional RNAs in cells, some code for proteins, some act as enzymes (such as ribosomal RNA) and some regulate the translation and stability of mRNAs (miRNAs being a nice example). For a small list of RNA types including all these non-coding RNAs (ncRNAs) that have some function, see this post.
Well these extra RNAs, only 2% of them code for protein. Only a small fraction of the RNAs have some sort of secondary structure (indicating that the RNA itself may act as an enzyme or have some biological purpose.) And the rest (i.e. the majority)? Now a small number of 1% of the genome can be quite a bit, from the paper:
Many known ncRNAs are characterized by a well-defined RNA secondary structure. We applied two de novo ncRNA prediction algorithms--EvoFold and RNAz--to predict structured ncRNAs (as well as functional structures in mRNAs) using the multi-species sequence alignments (see below, Supplementary Information section 2.11 and ref. 26). Using a sensitivity threshold capable of detecting all known miRNAs and snoRNAs, we identified 4,986 and 3,707 candidate ncRNA loci with EvoFold and RNAz, respectively. Only 268 loci (5% and 7%, respectively) were found with both programs, representing a 1.6-fold enrichment over that expected by chance; the lack of more extensive overlap is due to the two programs having optimal sensitivity at different levels of GC content and conservation. We experimentally examined 50 of these targets using RACE/tiling-array analysis for brain and testis tissues (see Supplementary Information sections 2.11 and 2.9.3); the predictions were validated at a 56%, 65%, and 63% rate for Evofold, RNAz and dual predictions, respectively.
Wow, and this is in only 1% of the genome ... multiply all these numbers by 100 and that is quite a bit of potential functional ncRNA! We may be more of an RNA beast than is currently appreciated.
And the rest of the noise (i.e. the majority)? No idea. What is really puzzling is that these RNAs originate from portions of the genome that are not evolutionarily conserved in mammals. Here is a graph of the %conservation from the paper:

CDs: Coding regions (genomic sequences that are annotated and encode proteins). These are highly conserved from Human, to mouse to dog. No surprise.
UTRs: Untranslated regions. These are parts of the genome that go into the mRNA at the beginning (5'UTRs) and at the end (3'UTRs) of the coding part of the mRNA. Again it is not surprising that these are also conserved but not as much as the CDs. In a nice twist 5'UTRs and 3'UTRs are equally conserved, although most research right now is focused on how 3'UTRs regulate the behavior of the mRNA.
RxFrag: These are primary transcripts (that 70-90% of the genome that is transcribed).
Un.TxFrag: These are the fraction of long lived transcripts that are of unknown function.
ARs: Ancient repeats. These are ancient DNA fragments that are thought to have been inserted in the genome at an early stage of mammalian evolution and serve no known biological role. They are used as a measure of random mutational drift.
From the data you can see that the portions of the genome that are transcribed but do not have any known biological role (RxFrag and Un.TxFrag) are poorly conserved. It is thus likely that these unannotated transcriptional products are representative of transcriptional noise. Thus it would seem like there is a lot of noise in how genes are transcribed, both at the level of transcriptional start sites and at the level of the mRNA.
Yes, there is a slight bit of conservation (compare RxFrag and Un.TxFrag vs ARs). Why might that be? There are several possibilities:
a) Hidden in these fragments are some important and conserved transcripts and this accounts for the little bit of conservation. There may be other functional ncRNAs that fold in ways that escape our predictive algorithms.
b) Although the transcriptional noise does not hurt the organism, there is some constraint as certain types of noise ARE harmful. So the noise that is there is a tad constrained in that it can't turn to harmful noise. The noise is created by the serendipitous binding of the right transcriptional initiators to these extra transcriptional start sites.
c) The transcripts do serve some role. As I've discussed before, organisms are very plastic and malleable. Cellular machines act as self organizing entities that have core components that give a robustness yes a plasticity to the cell. Think of these cellular machines as having jello like properties, they can be metaphorically pushed and shoved by peripheral entities and this helps the jello take on many shapes, yet jello has a capacity to hold together despite these pushes and shoves. And the bonus from this type of setup is that it does not take many peripheral components to change the look of your jello blob. The ability to be adaptive and to use this adaptability to be molded by peripheral players that are subject to evolutionary change has been called evolvability or the Baldwin effect. All these unknown transcripts may have some small effect on the core cellular machinery that is encoded by the conserved parts of the genome. Now having said that this "noise" may actually have a role, I personally find this possibility a little improbable, but still possible.
d) An interesting possibility is that the act of transcription may be important, but the transcript itself may not be so crucial. This option is a hybrid of (b) and (c). Many believe that transcription and histone modifications may go hand in hand.
e) Organisms have a lot of experimentation going on. From this noise, some useful RNAs sometimes appear. Perhaps certain non functional transcripts are worth conserving because they often give rise to important functional genes. Organisms that have this noise are selected because the noise often gives rise to real tools that can be used.
f) Some DNA regulatory motifs (annotated by ENCODE as regulatory factor binding regions - RFBRs in the bar graph above) may not have been picked up by the assays used in this study. Some of these regions may be at times transcribed into these unannotated transcripts. A trivial explanation for all this. Now why would these DNA regulatory sequence be transcribed? Perhaps it is just chance. Or maybe the transcription of these regions are critical for their ability to regulate gene activation ... as I suggested in (d).
One last thing about this "noise" transcription (if it is really noise) ... it is highly variable between different tissues and cell types. Here is a graph from the paper. They investigated the stable transcripts from 11 sources. You'll notice that much of the transcription from intergenic sequences and intronic sequences only appears in only one of the 11 cell lines (the blue and violet bars in the graph below). In contrast much of the annotated transcripts that code for proteins (exonic or green bars) are found in multiple cell lines.

4) What is conserved? Now here I have a big bone to pick with all the science journalism out there. Yes it would seem that of the highly conserved bits of the genome, 40% are unaccounted for IN THIS STUDY! It is likely that these bits of the genome are regulatory elements (i.e. promoters, enhancers, repressors ...) that control which genes are tuned on when. Now seeing that there are thousands of DNA binding elements and this study only analyzed where a hand full of these proteins bound, it is likely that the rest of this conserved genomic sequence are regions of DNA that bind to these DNA binding proteins. From the article:
The large fraction of constrained sequence that does not match any experimentally identified elements is not surprising considering that only a limited set of transcription factors, cell lines and biological conditions have thus far been examined.
5) Transcription is sloppy. Genes have several transcriptional start sites, transcripts begin and end at varrious genomic sites. mRNAs are very heterologous. And lots of transcripts match the template and not the sense strand. So transcription of any single gene leads to wide array of mRNA products. No real surprise here. Just look at all the RNA bits that are pull out of expression tagged libraries (ESTs). In a weird coincidence, this data comes out just as NCBI is trying to clean up it's data base of all these mRNAs.
Overall the ENCODE project has brought together all these little bits of information to provide a clearer picture of how the genome is being managed. Some might say that it is a lot messier than we think. But if you think about binding constants and the sloppiness of biology in general ... the ENCODE findings make a great deal of sense.
For more on the ENCODE project visit their site.
 
what parts of the genome get translated into RNA
Alex, c'mon!! "Translated" into RNA? I expect better out of you.
And how realistic is expression data from cell lines? I'm hesitant to accept any of the transcription data from ENCODE.
Thats a good summary Alex. I read the paper about a week after it came out and with all the media fuss about if being full of paradigm shifting results ringing in my head and came to a similar conclusion to you. Most of the findings are confirmations of previous studies showing high levels of genomic transcription and regulatory sequences internal and distal to gene sequences as well as the customary upstream promoter. The paper by Barski et al in Cell a few weeks back regarding histone modification in T cells provides a good companion read to the Encode project paper and I'd advise you to go read that if you havent seen it yet (it wasnt part of the encode project but it provides a lot of nice detail about the histone status aspect of transcriptional regulation).
Sorry RPM, I do that all the time and it drives my PI nuts.
Yeah, I'm with MartinC,
All these guys claiming a big paradigm shift, who are they? Do they actually understand molecular biology?
MartinC,
I'll have to look at that paper, I saw it but haven't delved in. It's funny I was on the phone with someone at Seed - we were talking about the ENCODE paper and I was correcting all the misconception that they had through reading bad coverage ... there is a lot of false ideas about the findings in the ENCODE paper that are travelling around. Hopefully people who aren't experts will find better sources than the mainstream news.
RPM,
Some of the experiments were done in cell lines, some in tissues. The experiments were all over the place and this inconsistency between were each experiment was done could have been avoided, however most of the main findings will probably apply to most cells. Of course this will explain certain aspects of the study, such as the 40% "unannotated" conserved sequences.
Just a dumb question from a non-biologist but what do we know about the life-sequence of (RNA) nucleotides? How are nucleotides made? Do they get reused multiple times by different RNA strands, and if so how does the recycling work? How long do they last?
Nucleotide metabolism is very complex. The building blocks of RNA and DNA (nucleotides and deoxynucleotides) are synthesized by every cell in your body. They can be destroyed through UV rays or simply during the course of normal cellular metabolism.
Nucleotides can be reused many times to form long RNA chains, theoretically they can be reused indefinitely. Once an RNA is degraded into its components, NMPs (nucleotide monophosphates) certain enzymes attach two extra phosphates to the end of the molecule to form NTPs (nucleotide triphosphates). These molecules are then the building blocks for new RNA molecules. When RNA polymerase wants to glue a nucleotide to the end of a growing RNA molecule, it takes an NTP and shaves off the two extra phosphates and attaches the NMP to the RNA chain.
Think of nucleotides as legos, and the RNA as a chain of legos constructed by your son or daughter. The life of the lego chain depends on your child's play. He/she will stick the legos together and pull them apart without affecting the functionality of each individual lego block. The life of a lego block depends on an entirely different set of factors, such as how sturdy is the plastic, and whether the parents of the child throw away the lego set the next time the family moves to a new house.