On the Origin of New Exons

By evolgen on September 7, 2006.

Nobel Intent has an excellent summary of a paper in the PNAS pipeline on the origin of new exons in the human genome. The authors compared genes between humans and seven other vertebrates to identify newly arisen exons. They found that many new exons are composed of repeat sequences, such as transposable elements. Also, recently evolved exons are more likely to be alternatively spliced, suggesting there is a "trial period" for a new exon before it can be fully incorporated into the protein coding sequence of a gene.

More like this

This is an interesting paper for a number of reasons. One of the things that fascinate me is the reliability of alternative splicing predictions. These are almost entirely based on EST data and that data is known to be flawed in too many ways to list here.

When you look closely at genes that have been intensively studied the alternative splicing predictions just don't make sense. This is especially true when the structure of the protein has been solved. That's why the most recent annotations of the genome ingnore the EST data.

Here's an example from my favorite genes: the HSP70 gene family. The ECgene database for human BiP (HSPA5) lists 14 splice variants (H9C10987). Many of them result in deletions and insertions of amino acid stretches in the hydrophobic core of the protein. This is one of the most highly conserved proteins in all of biology (the 650 amino acid residues of most mammals are almost identical). Does it make any sense that Homo sapiens would evolve new exons for inserting amino acids into the middle of the protein when no other species has them?

Of course it doesn't. That's why when you go to the EntrezGene entry for this gene (3309) you will see that none of the so-called alternative splice variants has been accepterd by the annotators in the latest release. It's very important that everyone understand what this decision means. It means that intelligent people (annotators) have correctly rejected all of the alternative splicing data for this gene. This so-called "data" is no different than the data for all other genes with so-called "alternative splice variants."

You see this same pattern in many well-studied genes. It leads to the conclusion that the alternative splicing databases are inaccurate for those genes that we know the most about. It strongly suggests that the entire database is flawed. The EST data is almost useless in predicting exons.

The Zhang & Chasin paper relies heavily on those databases to predict exons that have only "appeared" recently. If most of those exons were artifacts arising from the flawed EST data then we would expect the following ....

1. They would only be "included" in rare EST's. That's what the authors find.

2. They contain a high percentage of highly repetitive DNA resembling most of the junk DNA in the genome. That's what the authors find.

3. The nucleotide sequence of the predicted exons resembles that of non-coding DNA and differs considerably from the sequences of the surrounding true exons. This is exactly what the authors find.

The authors do not question the validity of the EST databases in predicting alternative splicing and new exons but there are some papers that do. It's time we started to pay attention. If the databases are wrong then papers like this one are completely useless. They don't tell us a damn thing about the evolution of exons because those new exons are artifacts.

Great points. A friend of mine has been searching the databases to identify examples of TEs inserted into protein coding genes. He won't accept a gene into his data set unless the protein (not DNA or RNA) has been sequenced and shown to contain the TE casette. Judging by the literature in this field, you'd think he'd be able to find tons and tons of these examples.

To date, I don't think he has identified a single one (or maybe one or two). Granted, protein sequence databases are far more sparse than genomic or EST sequence databases. But it appears that many (or most, or possibly all) examples of TE's inserted into protein coding genes are false positives.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

What An Eclipse Means For US President Donald Trump

More by this author

This is a Good-bye Post

January 16, 2009

This is the final post ever at evolgen. It was a fun 4+ years, the last three spent at ScienceBlogs, but it has come time for me to close up shop. When I first got into blogging, I did it as a way to share what was on my mind to the few people who would read what I had to say (usually in topics…

Mendel's Garden #27 - Call for Submissions

January 2, 2009

Mendel's Garden is the original genetics blog carnival. The next edition will be hosted by Jeremy at Another Blasted Weblog. If you would like to submit a blog post to be included in the carnival, send an email to Jeremy (jcherfas at mac dot com). The carnival should be posted within the next few…

Eric Lander Teaches?

December 20, 2008

John Hawks points out that Eric Lander has been appointed to co-chair Obama's Council of Advisers on Science and Technology along with science adviser John Holdren and Nobel Laureate Harold Varmus. Here's how the AP article describes Lander: Lander, who teaches at both MIT and Harvard, founded the…

The Implementation of Molecular Evolution for the Masses

December 18, 2008

A couple of years ago, there was talk in the bioblogosphere about getting the general public interested in bioinformatics and molecular evolution: Amateur bioinformatics? Lowering the Ivory Tower with Molecular Evolution Molecular Evolution for the Masses The idea was inspired by the findings of…

Do people still use microarrays?

December 17, 2008

Larry Moran points to a couple of posts critical of microarrays (The Problem with Microarrays): Why microarray study conclusions are so often wrong Three reasons to distrust microarray results Microarrays are small chips that are covered with short stretches of single stranded DNA. People…