Let's slap ENCODE around some more

Since we still have someone arguing poorly for the virtues of the ENCODE project, I thought it might be worthwhile to go straight to the source and and cite an ENCODE project paper, Defining functional DNA elements in the human genome. It is a bizarre thing that actually makes the case for rejecting the idea of high degrees of functionality, which is a good approach, since it demonstrates that they've at least seen the arguments against them. But then it sails blithely past those objections to basically declare that we should just ignore the evolutionary evidence.

Here's the paragraph where they discuss the idea that most of the genome is non-functional.

Case for Abundant Junk DNA. The possibility that much of a complex genome could be nonfunctional was raised decades ago. The C-value paradox refers to the observation that genome size does not correlate with perceived organismal complexity and that even closely related species can have vastly different genome sizes. The estimated mutation rate in protein-coding genes suggested that only up to ∼20% of the nucleotides in the human genome can be selectively maintained, as the mutational burden would be otherwise too large. The term “junk DNA” was coined to refer to the majority of the rest of the genome, which represent segments of neutrally evolving DNA. More recent work in population genetics has further developed this idea by emphasizing how the low effective population size of large-bodied eukaryotes leads to less efficient natural selection, permitting proliferation of transposable elements and other neutrally evolving DNA. If repetitive DNA elements could be equated with nonfunctional DNA, then one would surmise that the human genome contains vast nonfunctional regions because nearly 50% of nucleotides in the human genome are readily recognizable as repeat elements, often of high degeneracy. Moreover, comparative genomics studies have found that only 5% of mammalian genomes are under strong evolutionary constraint across multiple species (e.g., human, mouse, and dog).

Yes, that's part of it: it is theoretically extremely difficult to justify high levels of function in the genome -- the genetic load would be simply too high. We also see that much of the genome is not conserved, suggesting that it isn't maintained by selection. Not mentioned, though, are other observations, such as the extreme variability in genome size between closely related species that does not seem to be correlated with complexity or function at all, or that much "junk" DNA can be deleted without any apparent phenotypic effect. It's very clear to anyone with any appreciation of evolutionary constraints at all that the genome is largely non-functional, both on theoretical and empirical grounds.

Their next paragraph summarizes their argument for nearly universal function. It's strange because it is so orthogonal to the previous paragraph: I'd expect at least some token effort would be made to address the constraints imposed by the evolutionary perspective, but no…the authors make no effort at all to reconcile what evolutionary biologists have said with what they claim to have discovered.

That's just weird.

Here's their argument: most of the genome gets biochemically modified to some degree and for some of the time.

Case for Abundant Functional Genomic Elements. Genome-wide biochemical studies, including recent reports from ENCODE, have revealed pervasive activity over an unexpectedly large fraction of the genome, including noncoding and nonconserved regions and repeat elements. Such results greatly increase upper bound estimates of candidate functional sequences. Many human genomic regions previously assumed to be nonfunctional have recently been found to be teeming with biochemical activity, including portions of repeat elements, which can be bound by transcription factors and transcribed, and are thought to sometimes be exapted into novel regulatory regions. Outside the 1.5% of the genome covered by protein-coding sequence, 11% of the genome is associated with motifs in transcription factor-bound regions or high-resolution DNase footprints in one or more cell types, indicative of direct contact by regulatory proteins. Transcription factor occupancy and nucleosome-resolution DNase hypersensitivity maps overlap greatly and each cover approximately 15% of the genome. In aggregate, histone modifications associated with promoters or enhancers mark ∼20% of the genome, whereas a third of the genome is marked by modifications associated with transcriptional elongation. Over half of the genome has at least one repressive histone mark. In agreement with prior findings of pervasive transcription, ENCODE maps of polyadenylated and total RNA cover in total more than 75% of the genome. These already large fractions may be underestimates, as only a subset of cell states have been assayed. However, for multiple reasons discussed below, it remains unclear what proportion of these biochemically annotated regions serve specific functions.

That's fine. Chunks of DNA get shut down to transcription by enzymatic modification; we've known that for a long time, but it's generally regarded as evidence that that bit of DNA does not have a useful function. But to ENCODE, DNA that is silenced counts as a function. Footprint studies find that lots of bits of DNA get weakly or transiently bound by transcription factors; no surprise, it's what you'd expect of the stochastic processes of biochemistry. Basically they're describing behavior as functional that which is more reasonably described as noise in the system, and declaring that it trumps all the evolutionary and genetic and developmental and phylogenetic observations of the genome.

No, I'm being too charitable. They aren't even trying to explain how that counters all the other evidence -- they're just plopping out their observations and hoping we don't notice that they are failing to account for everything else.

I rather like Dan Graur's dismissal of their logic.

Actually, ENCODE should have included “DNA replication” in its list of “functions,” and turn the human genome into a perfect 100% functional machine. Then, any functional element would have had a 100% of being in the ENCODE list.

More like this

this needs some editorial work...but I think I get what your saying.

By Marc Meneghini (not verified) on 31 Dec 2014 #permalink

The funny thing is that they *still* do not get the importance of defining "function". They treat differences in interpretation as a matter of using different approaches to measuring function-- biochemical, genetic and evolutionary-- while assuming that these approaches all measure the same thing, and merely differ in sensitivity somehow. It is an extraordinarily naive analysis, and an excellent illustration of the blind-men-and-the-elephant parable.

By Arlin Stoltzfus (not verified) on 31 Dec 2014 #permalink

I always understood that retroviruses co-opted host regulatory machinery and vice versa constituting the acme in molecular host-parasite coevolution.


Meanwhile, the different distributions of Alu and LINE1 in the genome would suggest that selection pressure may be involved. Do Alus direct methylation? Are Alus and Line1 DNA symbionts?

By ShayGaetz (not verified) on 02 Jan 2015 #permalink

@ P Z Myers:

I wonder out loud if you are guilty of a non sequitur.

"If having junk DNA were a clear advantage for future evolution then the genomes of all extant lineages should have lots of junk DNA and should make lots of lncRNAs."

I hope this is not what you are claiming.

Meanwhile, I am reminded that 90%+ of identifiable genetic disease occurs due to mutation outside the protein coding region. Things that make you go hmmm....

ITMT - I am fascinated by Peter Fraser's (and others') work on chromosome architecture and wonder out loud whether much more "functionality" (OK, perhaps a lot of "redundant" functionality) still remains unrecognized, especially for lncRNAs.

I remain perplexed if not downright confused and welcome correction from any and all. So far, I suffer the distinct impression the jury is still out on this question.

By ShayGaetz (not verified) on 02 Jan 2015 #permalink

I use ENCODE data. As for most large data that has papers going along with it, I'm interested in your data - what fool things you had to say about your data is irrelevant.
I'm not saying Myers points about this paper are wrong (and I liked Stoltzfus comment). Just that the first sentence bothered me. Genome sequencing projects also result directly in papers - but we don't judge the value of the project by the paper, but rather the data. Was ENCODE worth the money - not sure.

Tom Cavalier-Smith addressed the C-value paradox in the 1970s, I think, and pointed out that a large genome full of 'junk' DNA may be structural (he thought it was related to the need for a large nucleus, itself related to the speed of early embryonic gene transcription and cell division: I cannot comment on whether this is more or less convincing than more modern versions linking overall chromosome and nuclear structure to transcriptional control). So DNA can:-

- have no phenotypic effect ("function") and yet still be transcribed, modified etc. providing the transcripts are not actively disadvantageous
- have a structural (ie relatively sequence non-specific) phenotypic effect, again regardless of its transcription.
- have a sequence-related role in organismal development or function
- be classic 'selfish' DNA which will be sequence constrained by evolution but not for reasons relating to organismal development, except again insofar as it does not kill the organism.

As I understand it, ENCODE mistakes 'transcription' for 'function', and so cannot distinguish any of these.

I am sure it is a useful data set, but the claims made for it are over-stated.

My understanding is that the PNAS paper was very much written by committee. One individual wrote the junk DNA paragraph, while others wrote the following portions of the paper.

By Alex Palazzo (not verified) on 07 Jan 2015 #permalink