Evolution of Non-Coding Elements in Vertebrates

(Disclaimer: this is not my field but the paper looked interesting so here goes ...)

Promoters, enhancers and other DNA regulatory elements that turn on or off gene transcription are important. We've known this for quite a while. Many would argue that metazoans all have the same major gene families. Getting closer to us, most vertebrates have the same types of cells and have very similar genes and gene counts. That is not surprising as most genes encode the different tools that go into making the each major cell type found in all vertebrates. To rephrase this idea in a different manner (so that you won't forget), all vertebrates have neurons, muscle cells, fibroblasts and thus it is not surprising that vertebrates have almost the same collection of genes. What distinguishes one vertebrate from another is how these cells are specified and placed together. Activation of slightly different gene programs lead to modifications in cell migration, tissue patterning and thus body shape. Thus in vertebrates it is thought (by many) that evolution works to a large extent on gene regulation ... in other words, selection acts on

1) the DNA that turns on and off genes, and
2) the proteins that enact the turning on/off of these signals (transcription factors).

Now that doesn't mean that coding proteins have no role to play in vertebrate evolution. In fact from recent debates (outside my field) it would seem that changes to protein coding areas of the genome probably contribute significantly to evolution. But non-coding DNA regulatory elements are likely to play bigger roles.

From the ENCODE paper (see this entry) we learned that about half the conserved bits of DNA in mammals corresponds to transcripts that go on to specify proteins. Much of the rest of this conserved DNA is controlling gene expression. Now in a new PLoS genetics paper, we get better look at how these Conserved Non-coding Elements (CNCs) vary across mammals and across vertebrates.

This analysis (from human, chimp, dog, mouse, rat, chicken, fugu and zebrafish genomes) demonstrates that there is a quite a bit of these CNCs that are conserved but also that there are quite a number of these bits that are undergoing rapid evolution.

Here is figure from the paper that indicates how these CNCs are changing in the mouse genome relative to the other mammals. i-4446187ed67616bd0ad2da158a486f01-evolution rates.jpg In panel A we see how the mouse CNCs stack up to their mammalian counterparts. A p-value near 0 indicates an increased rate of change, and a p-value near 1 indicates a decreased rate.

From the text:

At the significance level of 0.001, 1027 (1.2%) and 503 (0.6%) mammalian CNCs show speed-ups and slow-downs, respectively. Among amniotic CNCs, 228 (1.4%) and 106 (0.6%) show speed-ups and slow-downs, respectively on the mouse lineage.

In panel B we get a closer look at the fast and slowly evolving CNCs in mouse:

Fast- and slow-evolving CNCs are indicated in red and blue, respectively. The violet dashed horizontal line shows the genome-wide average substitution rate on the mouse lineage for unconstrained regions near the fast-evolving CNCs.

So what does this tell us?

We estimate that 68% (54,643/81,957) of the mammalian CNCs evolve at a single rate. The remaining nonneutral CNCs show rate changes on at least one lineage.

So 2/3rds of the CNCs in animals are changing at a constant rate in every lineage. Of the 1/3rd that is left, half are slowing down in some lineages (but are changing at the same rate in the other lineages) and half are speeding up, again in only a subset of the lineages. As for humans and chimps,

... there are 638 and 530 CNCs that show rate speed-ups on the human and chimpanzee lineages, respectively, far more than the four and eight CNCs, respectively, showing slow-downs.

Dogs also show more speed-ups then slow-downs. Since these speed-ups are faster then the expected rate of neutral change, these changes are likely due to an acceleration in evolution (i.e. positive selective change).

So what is changing? From the paper:

We next looked at whether CNCs showing significant rate speed-ups are more likely to be in the proximity of particular kinds of genes [17], using the PANTHER GO database [32]. A significant difficulty in this sort of analysis is that even for those CNCs that act as cis-regulators, it is unknown which of the nearby genes is being regulated. However, as a rather imperfect proxy for this we simply used, for each CNC, the nearest gene (in either orientation). For each branch of the mammalian tree, we divided the CNCs into those with increased rate on that branch (by AIC) and used CNCs evolving under the null model as "neutral" controls. We looked at whether particular biological process categories were enriched among the nearest genes of the selected CNCs compared to the neutral CNCs.

For mammalian CNCs, there is significant enrichment of the process categories "amino acid activation" and "other coenzyme and prosthetic group metabolism" on the dog and the lineage leading to the common ancestor of mouse and rat (rodent lineage), respectively, at p

I looked at these tables (especially this one) and ... yeah it looks like a collection of random items, although two of the categories near rapidly changing human CNCs were "sensory perception" and "neuronal activities" ...

ref:
Su Yeon Kim, Jonathan K. Pritchard
Adaptive Evolution of Conserved Noncoding Elements in Mammals
PLoS Genetics Vol. 3, No. 9, e147 doi:10.1371/journal.pgen.0030147

More like this

Without having read the paper, I wonder if the data presentation in the figures is really appropriate. In particular the use (abuse?) of p-values in the panel A is quite unusual. Normally, a p-value indicates a statistical significance that an observation deviates from an underlying null model, with very small p-values being very significant. p-values close to 1 correspond to a (boring) expected behavior according to the null-model. According to your snippet, the authors seem to suggest that high p-vals are also significant (for an opposite deviation from randomness). This is highly unusual, to say the least.
One more caveat: good p-values do NOT indicate that an effect is particularly strong - they just indicate that we can be particularly sure that the effect is non-random.

Does anyone know which program might have been used to produce the second graph? :)

Normally, a p-value indicates a statistical significance that an observation deviates from an underlying null model, with very small p-values being very significant. p-values close to 1 correspond to a (boring) expected behavior according to the null-model

under null expectation, a series of p-values should follow a uniform distribution on (0,1). the distribution in the panel does not-- there is a clear enrichment for small and large values. The figure is a way of communicating that clearly (and effectively, in my opinion). The authors could equally have tested two hypotheses (speedup vs. slowdown) and the large p-values would have become small ones in the opposite test (and vice versa).

Does anyone know which program might have been used to produce the second graph?

R

Question:

you write:

"in other words, selection acts on

1) the DNA that turns on and off genes, and
2) the proteins that enact the turning on/off of these signals (transcription factors). "

isn't this redundant, since the DNA (i.e. 1) can't turn genes on or off without coding for the proteins (i.e. 2)

p-ter thanks for that. I reread that section and this is what the authors write:

For each CNC, we calculated [Shared Rates Test; SRT]i for each of the seven branches of the mammalian tree to identify CNCs that have experienced a speed-up or slow-down on a particular branch. Figure 4A shows the histogram of p-values on the mouse lineage (SRTm) for the mammalian CNCs. The p-values are defined as P( SRTi > srti ) where srti is the observed value. Hence, p-values near 0 indicate increased rates, and near 1 indicate decreased rates. The histogram is flat for intermediate p-values with peaks at both ends, suggesting that most CNCs fit the null distribution of SRTm, but with a substantial number of outliers.

If anyone out there has a plain word explanation of what these statistics mean, please leave a comment or a link.

Raj,

Genes get turned on when the right transcription factors bind to the right DNA elements found upstream (or downstream) of the gene of interest. All I am saying is that to affect gene transcription (i.e. turning on or off genes) you can modify the DNA element (point 1) or the transcription factor (point 2). What the ENCODE project found was that about half of the conserved elements in the mammalian genome consist of DNA elements whose only role is to bind to transcription factors and regulate gene expression. Many would agree that the gradual change of these DNA elements is the prime force in directing vertebrate evolution.

the shared rates test is a statistic invented by the authors-- essentially, you can fit a single rate of sequence change across the entire phylogeny to a given region, then fit a model where there are two parameters-- an overall rate, plus a lineage specific rate (ie. a mouse-specific or human specific rate), and test the probability that the lineage-specific rate is slower than the overall rate.

If the lineage-specific rate is much slower, the probability will go to 1, while if the lineage-specific rate is much faster, the probability will go to zero. Again, if they had chosen to test whether the lineage-specific rate were faster than the overall rate, that would be reversed.