The ENCODE delusion

I can take it no more. I wanted to dig deeper into the good stuff done by the ENCODE consortium, and have been working my way through some of the papers (not an easy thing, either: I have a very high workload this term), but then I saw this declaration from the Electronic Frontier Foundation.

On September 19, the Ninth Circuit is set to hear new arguments in Haskell v. Harris, a case challenging California’s warrantless DNA collection program. Today EFF asked the court to consider ground-breaking new research that confirms for the first time that over 80% of our DNA that was once thought to have no function, actually plays a critical role in controlling how our cells, tissue and organs behave.

I am sympathetic to the cause the EFF is fighting for: they are opposing casual DNA sampling from arrestees as a violation of privacy, and it is. The forensic DNA tests done by police forces, however, do not involve sequencing the DNA, but only look at the arrangement of known variable stretches of repetitive DNA by looking at just the length of fragments cut by site-specific enzymes; they can indicate familial and even to some degree ethnic relationships, but not, as the EFF further claims, "behavioral tendencies and sexual orientation". Furthermore, the claim that 80% of our genome has critical functional roles is outrageously bad science.

This hurts because I support the legal right to genetic privacy, and the EFF is trying to support it in court with hype and noise; their opposition should be able to easily find swarms of scientists who will demolish that argument, and any scientifically knowledgeable judge should be able to see right through the exaggerations (maybe they're hoping for an ignorant judge?). That conclusion, that 80% of the genome is critical to function, is simply false, and it's the notorious dishonest heart of ENCODE's conclusions.

And then there is this lovely little commercial for ENCODE, narrated by Tim Minchin, and portraying ENCODE as a giant cancer-fighting robot.

Oh, jebus…that was terrible and cringeworthy. Not just the ridiculous exaggerations … the Human Genome Project also claimed that it would provide the answers to all of human disease, as has, to a lesser degree, most every biomedical grant proposal, it seems — but that they invested in some top-notch voice talent and professional animation to promote some fundamentally esoteric science to the general public as a magic bullet…I mean, robot.

Scientists, don't do this. Do make the effort to communicate your work to the public, but don't do it by talking down to them and by portraying your work in a way that is fundamentally dishonest and misleading. If you watch that video, ask yourself afterward: if I hadn't read any of the background on that project, would I have the slightest idea what ENCODE was about from that cartoon? There was no usable information in there at all.

So what is ENCODE, actually? The name stands for Encyclopedia of DNA Elements, and it's the next step beyond the Human Genome Project. The HGP assembled a raw map of the genome, a stream of As and Gs and Cs and Ts, and dumped it in our lap and told us that now we have to figure out what it means. ENCODE attempts to break down that stream, reading it bit by bit, and identifying what each piece does; this part binds to a histone, for instance, or this chunk is acetylated in kidney cells, or this bit is a switch to turn expression of Gene X off or on. It tries to identify which genes are active or inactive in various cell types. It goes beyond the canonical sequence to look at variation between individuals and cell types. It identifies particular genetic sequences associated with Crohn's Disease or Multiple Sclerosis or that are modified in specific kinds of cancers.

ENCODE also looks at other species and does evolutionary comparisons. We can identify sequences that show signs of selection within the mammals, for instance, and ENCODE then maps those sequences onto proposed functions.

You know what? This is really cool and important stuff, and I'm genuinely glad it's being done. It's going to be incredibly useful information. But there are some unfortunate realities that have to be dealt with.

It's also drop-dead boring stuff.

I remember my father showing me a pile of maintenance manuals for some specific aircraft at a Boeing plant when I was a kid; these were terrifyingly detailed, massive books that broke down, bit by bit, exactly what parts were present in each sub-assembly, how to inspect, remove, replace, repair, and maintain a tire on the landing gear, for instance. It's all important and essential, but…you wouldn't read it for fun. When you had a chore to do, you'd pull up the relevant reference and be grateful for it.

That's ENCODE. It's a gigantic project to build a reference manual for the genome, and the papers describing it are godawful tedious exercises in straining to reduce a massive data set to a digestible message using statistics and arrays of multicolored data visualization techniques that will give you massive headaches just looking at them. That is the nature of the beast. It is, by necessity and definition, a huge reference work, not a story. It is the antithesis of that animated cartoon.

I'm uncomfortable with the inappropriate PR. The data density of the results makes reading the work a hard slog…but that's the price you have to pay for the volume of information delivered. But then…disaster: a misstep so severe, it makes me mistrust the entire data set — not only are the papers dense, but I have no confidence in the interpretations of the authors (which, I know, is terribly unfair, because there are hundreds of investigators behind this project, and it's the bizarre interpretations of the lead that taints the whole).

I refer to the third sentence of the abstract of the initial overview paper published in Nature; the first big razzle-dazzle piece of information the leaders of the project want us to take home from the work. That 80%:

These data enabled us to assign biochemical functions for 80% of the genome.

Bullshit.

Read on into the text and you discover how they came to this startling conclusion:

The vast majority (80.4%) of the human genome participates in at least one biochemical RNA- and/or chromatin-associated event in at least one cell type.

That isn't function. That isn't even close. And it's a million light years away from "a critical role in controlling how our cells, tissue and organs behave". All that says is that any one bit of DNA is going to have something bound to it at some point in some cell in the human body, or may even be transcribed. This isn't just a loose and liberal definition of "function", it's an utterly useless one.

Now this is all anyone talks about when describing this research: that it has found a 'function' for nearly all of human DNA (not true, and not supported by their data at all) and that it spells the demise of junk DNA, also not true. We know, for example, that over 50% of the human genome has a known origin as transposable elements, and that those sequences are basically parasitic, and has no recognizable effect on the phenotype of the individual.

I don't understand at all what was going through the head of the author of that paper. Here's this awesome body of work he's trying to summarize, he's representing a massive consortium of people, and instead of focusing on the useful, if rather dry, data the work generated, he decides to hang it all on the sensationalist cross of opposing the junk DNA concept and making an extravagant and unwarranted claim of 80 going on 100% functionality for the entire genome.

Well, we can at least get a glimpse of what's going on in that head: Ewan Birney has a blog. It ended up confusing me worse than the paper.

For instance, he has a Q&A in which he discusses some of the controversy.

Q. Hmmm. Let’s move onto the science. I don’t buy that 80% of the genome is functional.
A. It’s clear that 80% of the genome has a specific biochemical activity – whatever that might be. This question hinges on the word “functional” so let’s try to tackle this first. Like many English language words, “functional” is a very useful but context-dependent word. Does a “functional element” in the genome mean something that changes a biochemical property of the cell (i.e., if the sequence was not here, the biochemistry would be different) or is it something that changes a phenotypically observable trait that affects the whole organism? At their limits (considering all the biochemical activities being a phenotype), these two definitions merge. Having spent a long time thinking about and discussing this, not a single definition of “functional” works for all conversations. We have to be precise about the context. Pragmatically, in ENCODE we define our criteria as “specific biochemical activity” – for example, an assay that identifies a series of bases. This is not the entire genome (so, for example, things like “having a phosphodiester bond” would not qualify). We then subset this into different classes of assay; in decreasing order of coverage these are: RNA, “broad” histone modifications, “narrow” histone modifications, DNaseI hypersensitive sites, Transcription Factor ChIP-seq peaks, DNaseI Footprints, Transcription Factor bound motifs, and finally Exons.

Oh, jeez, straining over definitions—ultimately, what he ends up doing is redefining "functional" to not mean functional at all, but to mean simply anything that their set of biochemical assays can measure. It would have been far more sensible to use a less semantically over-loaded word or phrase (like "specific biochemical activity") than to court confusion by charging into a scientific debate about functionality that he barely seems to comprehend. It would have also conformed to the goals he claims to have wanted to achieve with public education.

ENCODE also had the chance of making our results comprehensible to the general public: those who fund the work (the taxpayers) and those who may benefit from these discoveries in the future. To do this we needed to reach out to journalists and help them create engaging stories for their readers and viewers, not for the readers of Nature or Science. For me, the driving concern was to avoid over-hyping the medical applications, and to emphasize that ENCODE is providing a foundational resource akin to the human genome.

Uh, "giant cancer-fighting robot", anyone? Ewan Birney's name is right there in the credits to that monument to over-hyping the medical applications.

I'll be blunt. I don't think Birney has a clue about the biology. So much of what he has said about this project sounds human-centered and biased towards gross misconceptions about our place in biology. "We are the most complex things we know about," he says, and seems to think that there is a hierarchy of complexity that correlates with the phylogenetic series leading to humans, where, for instance, fugu are irrelevant to the argument because they're not a mammal. This is all nonsense. I would not be at all surprised to learn that the complexity of the teleost genome is significantly greater than that of the tetrapod genome; and there's nothing more complex about our genetics than that of a mouse. I get the impression of an extremely skilled technologist with almost certainly some excellent organizational skills, who is completely out of his depth on the broader conceptual issues of modern biology. And also, someone who is a total media disaster.

But I'm just a guy with a blog.

There is a mountain of material on ENCODE on the web right now — I've come late to the table. Here are a few reading recommendations:

Larry Moran has been on top of it all from day one, and has been cataloging not just the scientific arguments against ENCODE's over-interpretation, but some of the ridiculous enthusiasm for bad science by creationists.

T. Ryan Gregory has also been regularly commenting on the controversy, and has been confronting those who claim junk DNA is dead with the evidence: if organisms use 100% of their genome, why do salamanders have 40 times as much as we do, and fugu eight times less?

Read Sean Eddy for one of the best summaries of junk DNA and how ENCODE hasn't put a dent in it. Telling point: a random DNA sequence inserted into the human genome would meet ENCODE's definition of "functional".

Seth Mnookin has a pithy but thoughtful summary, and John Timmer, as usual, marshals the key evidence and makes a comprehensible overview.

Mike White summarizes the ENCODE projects abject media failure. If one of Birney's goals was to make ENCODE "comprehensible to the general public", I can't imagine a better example of a colossal catastrophe. Not only does the public and media fail to understand what ENCODE was about, but they've instead grasped only the completely erroneous misinterpretation that Birney put front and center in his summary.

You'll be hearing much more about ENCODE in the future, and unfortunately it will be less about the power of the work and more about the sensationalistic and misleading interpretation. The creationists are overjoyed, and regard Birney's bogus claims about the data as a vindication of their belief that every scrap of the genome is flawlessly designed.

More like this

I don't think DNA fingerprinting has been based on RFLP analysis for a looong time. It's based on multiplex PCR of 13 STR sites. Pretty straightforward to argue that, regardless of the 'activity' of 80%, these sites are not going to reveall personal phenotypes. Plus, the numbers are encoded and secret for cryin out loud.

Great post! But while we do know that over 50% of the human genome is derived from transposable elements, a lot of them do have recognizable effects on the phenotype. For decades now we have known that TEs can donate promoters and enhancers to activate nearby genes, and we have shown that some TEs have very specific roles in endometrial biology that are important for the establishment and maintenance of pregnancy.

I hated that 80% bit because it implied that ENCODE found this out for the first time, and not that it was suggested more than 40yrs ago by McClintock, and Britten and Davidson nor that there were piles of really nice studies showing TEs have clear gene regulatory functions.

By Vinny Lynch (not verified) on 23 Sep 2012 #permalink

On the other part of the battlefield is the young challenger: Google. Founded in 1998 by Larry Page and Sergey Brin, Google set out to become the best search engine on the Web. Considering that the Web had been around since the early '90s, Google had a lot of catching up to do. They did just that, and today Google is a powerful force on the Web,pls visit our web:http://www.myoffice2010.com. offering a suite of products ranging from mapping applications to mobile platforms. In June 2010, Google was trading at $488 a share on the NASDAQ and its market value was $119 billion [source: Wolfram

By Microsoft Offi… (not verified) on 24 Sep 2012 #permalink

@ Vinny Lynch
a lot of them do have recognizable effects on the phenotype

Please define "a lot of them".

There are 500,000-1,000,000 Alu elements, 400,000 MIRs, 200,000-500,000 LINE-1's, 300,000 LINE-2s, 200,000 DNA transposons, and ~250,000 LTRs, etc. This is about 30% of the coding space, according to (http://www.ncbi.nlm.nih.gov/books/NBK7587/table/A738/?report=objectonly).

Precisely what percent is "a lot"?

By The Other Jim (not verified) on 24 Sep 2012 #permalink

Very nice point of view of general biologist :-) Congratulations on having strong opinion and wanting to share it so bluntly :D Just to say - people that work on genome-wide regulation of transcription have no problem with 80% as it was already known. What is interesting that some of the data was done by the same labs-same techniques and same analysis. So we have some data cataloged and analyzed (by the way - lots of the genomewide data was already available but because analysis is so BORING - nobody did that on such scale. Instead everybody was waiting for ENCODE). By the way (again) - I do not work for them. I am just happy that they did this job for me :-) On another note - lots of genomewide data shows real functionality and that is why they use this term so often. And just to conclude - there are several points that are overstated in your blog (this note) i wish I had time to point it out all - not now though, I have to go back to my boring experiment :) I hope to be back!
Enjoying your blog!

@ The Other jim,

We for example this study that found "280,000 putative regulatory elements, totaling approximately 7 Mb" that are exapted from TEs

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0043…

This study which found that at least 5.5% of conserved non-exonic elements in the human genome are derived from TEs

http://www.pnas.org/content/104/19/8005.abstract

or my own study on a particular DNA transposon:

http://www.nature.com/ng/journal/v43/n11/full/ng.917.html

These are only a few of the studies. We are working now to correlate TEs with the ENCODE data now, and will have a bigger picture soon.

By Vinny Lynch (not verified) on 24 Sep 2012 #permalink

@ Vinny Lynch,
So you are estimating in the 10% range at the moment?

By The Other Jim (not verified) on 25 Sep 2012 #permalink

@The Other jim,

Based on publications it's something like 10%, based on our analysis of ENCODE data 30-60% of TFBS and epigenetic enhancer/promoter marks are found within TEs. So it is likely a much more widespread pattern.

By Vinny Lynch (not verified) on 25 Sep 2012 #permalink

"I get the impression of an extremely skilled technologist with almost certainly some excellent organizational skills, who is completely out of his depth on the broader conceptual issues of modern biology. And also, someone who is a total media disaster."

Ah yes, the "it's not real science, you haven't touched the infinite like I have" argument. Must suck to see oneself become obsolete.

@ Vinny Lynch, sorry for the delay. Work kept me from visiting.

We have 3.50pg per haploid genome

We will be generous and do the math with a lot of generous rounding up;
Protein genes 2%
Regulatory RNAs - 2%
rRNA+tRNA - 0.5%
centrometes, telomeres, ori's, etc. 5%
total - 9% = 0.34pg

Taking your high estimate of 60% and assuming 60% of TEs (50%) of the genome are required, 30% = 1.05pg

Total 1.39pg = 39.7% of the genome.

At 30% functional TEs, this is 24.7% of the genome.

The 30% functional TE estimate is pretty in line with Comings 1972 estimate.

By The Other Jim (not verified) on 27 Sep 2012 #permalink

I've met Ewan on several occasions. Nice guy who really does try hard to encourage collaboration and outreach programs. His net impact has been positive from my perspective (lots of free data for me to use).

He's screwed up on the 80% functional statement though and I'm not just saying that as a mindless parrot. I've been deep into sequencing data for 5 years now. Next-gen (Illumina) data at single base resolution for various species, various cell types, various time points, various chemical treatments and various mutants, mRNA, smRNA, ChIP-seq, 5mC, 5hmc, and numerous histone mods. Alongside our own data has been everything relevant that I could pull down from ENCODE, all piled up in relational databases for side-by-side analysis. Over a PetaByte of data under management.

From these data do I get the impression that 80% of the human genome is "functional"? Sure, I can pick just about any part of the human genome at random and between all the available datasets I can find a significant delta (after FDR correction), but that does *not* mean that the region has any relevant function.

By all means, report on the observed differences, but don't go claiming "function" without an experiment to demonstrate a solid biochemical mechanism that supports the claim.

Epigenetic deltas are dime a dozen and it seems as if the ENCODE authors are a bit over-excited by them. Epigenetic states vary a lot between cell types but it doesn't necessarily mean that DNA sequences under, say, variable histone marks, are actually involved in anything that's functionally important.

It seems from Ewan's attempt to clarify that the presence of RNA Pol II or the presence of H3K4me1 indicates function, but that stuff pops up EVERYWHERE. I've seen the ENCODE datasets in question, they're quite shitty (although were recently improved). Bad antibodies, and the controls are way, way too thin.

It's also highly misleading to infer (in the video) that the knowledge gleaned from ENCODE will have a huge impact on the understanding and treatement of disease. I remember all too well how over-excited people got about the human genome, but a decade on, what have we learned? Not even close to what was projected in the hype.

Having spent a long time looking at epigenetic data, I assure you, we're only just scratching the surface and even if we do make progress is associating epigenetic states with diseases, it's still a very long way away from understanding how to modify those states in any way that can lead to an effective treatment. We're talking about hugely complicated four-dimensional regulation of hormone cocktails (for example). There are many decades to go before we crack this egg and the hype from ENCODE is going to end up looking like the hype from the human genome project, I'm sure of it.