Next Generation Sequencing adds thousands of new genes

ResearchBlogging.orgI had the good fortune on Thursday to hear a fascinating talk on deep transcriptome analysis by Chris Mason, Assistant Professor, at the Institute for Computational Biomedicine at
Cornell University. 

Several intriguing observations were presented during the talk.  I'll present the key points first and then discuss the data.

These data concern the human transcriptome, and at least some of the results are supported by  follow on studies with data from the pigmy tailed macaque.

Some of the most interesting points from Mason's talk were:

  1. A large fraction of the existing genome annotation is wrong.
  2. We have far more than 30,000 genes, perhaps as many as 88,000. 
  3. About ten thousand genes use over 6 different sites for polyadenylation.
  4. 98% of all genes are alternatively spliced.
  5. Several thousand genes are transcribed from the "anti-sense"strand.
  6. Lots of genes don't code for proteins.  In fact, most genes don't code for proteins.

Mason also described the discovery of 26,187 new genes that were present in at least two different tissue types.  

This shakes things up a bit. 

What data supports these claims? And what does this have to do with the definition of a gene?

The data and analyses come from work that Mason has been involved in during the past few years (1-5). Much of the data came from the SEQC consortium, a group established to evaluate the reproducibility of Next Generation Sequencing technologies.  The SEQC project was initiated by the same group (MAQC) that examined reproducibility in microarrays (4).  The transcriptomes in these experiments were characterized using NGS RNA-Seq data from Roche (454), Helicos, Illumina (formerly Solexa), and LifeTech (formerly ABI) from 16 different human tissues.  Some of the analyses came from a collaboration with Geospiza (3).

Background What is a transcriptome?

A transcriptome is the complete collection of all the RNA molecules in a cell. Figure 1 shows many types of RNA that have been classified so far.  All of these molecules are called transcripts since they're produced by transcription. 

I think it's interesting that 11 different types of RNA are shown below and only one type codes for protein (mRNA). 

i-772347d33237de48effc60361b86c844-RNA-world-712445-thumb-400x550-60044.png

Fig. 1.  RNA drawing from FinchTalk used with permission from Todd Smith, Geospiza, Inc.(6). 

What have we been measuring and how did we get so many things wrong?

Mason began the seminar by reminding us that until 2009, our knowledge of the human transcriptome was based on a small number of cDNA libraries
of questionable quality. 

To put this information in perspective, I'm including a table that summarizes the total number of sequences in dbEST in 2009.  At that time, about 8 million sequences were available from humans.  It should also be noted that many of these cDNA libraries came from tumors or other unusual tissue types, which may have altered the composition of their transcriptomes relative to normal tissues. 

i-23de4d9706ca6458bd161116bc0e62b7-ESTs-v-NGS-722747.pngFig. 2.  Image from FinchTalk used with permission from Geospiza.

Eight million sequences sounds like quite a bit and it does represent 4-5 Gigabases of transcriptome sequence data.  Today, however, we have over 100 times more.  SEQC alone has obtained 600 Gb of RNA sequence data from sixteen human tissues and tens of billions of RNA molecules.  All this extra data has given us a much more comprehensive picture of the activities inside a cell and the ways the human genome gets put to use.

Collecting and analyzing more data has emphasized how little we knew before and how much has changed.

The larger numbers of data have also led to the conclusion that many of the annotations in the RefSeq
and Ensemble are incorrect or at least incomplete.  Even in June 2009,
comparing AceView with the data from MAQC and RefSeq indicated that many exons were missing (5).  Mason pointed out there are at twice as many exons as were thought and many more transcripts are spliced in new ways and polyadenylated at different locations.

How do we identify genes in RNA-Seq data?

There are several data analysis pipelines that researchers use. Each pipeline is specific for a particular type of analysis and there can be many steps depending on the research question.  The slide in the seminar had 20-50 little boxes of different operations. 

The pipelines that I've used and are most familiar with identify protein coding genes by aligning RNA-Seq data to annotated data from sources like RefSeq.  After generating the alignments, the number of aligning sequences are counted for each positions.  Since each alignment represents a transcript, the alignments allow us to count the number of RNA molecules produced from every gene.

If the sequences do not align to RefSeq, they might be identified through alignments to other databases, such as the databases for microRNAs, Ensemble, or AceView.

An alternative approach described in Mason's seminar, is to take the RNA-Seq data and assemble it. From the assembled data, Mason's group found that thousands of repetitive elements are expressed in a tissue specific manner.  In the non-repetitive DNA, they found about 26,000 new "genes."  Many of these new genes were expressed at low levels, transcribed from the opposite strand of known genes in a regions of introns. Further, these new genes do not code for proteins and their function is unknown.

How do all these new transcripts fit into our definition of a gene?

At one time we considered genes to be regions of DNA that coded for proteins.  That definition changed when we realized that ribosomes contained non-coding RNA and expanded with the realization that there are many types of enzymatically active RNA molecules inside a cell.  tRNAs, microRNAs, and assorted regulatory RNAs have given further insights into the many roles that RNA's play.  Now, we know from Mason's work and others (7) that most of the RNA in a cell doesn't code for proteins.

We also used to dismiss some transcripts as pseudogenes and other transcripts as "noise."  Maybe we were wrong, but it's still not clear which transcripts are encoded by genes and which represent "noise."

Maybe every DNA region that produces a transcript is a gene.

One thing is certain.  We won't be able to count the number of genes in a cell until we can agree on what a gene is.

References:

1.  Chris Mason's seminar 1-6-2011,  UW Systems Biology speaker series.

2.  Marioni, J., Mason, C., Mane, S., Stephens, M., & Gilad, Y. (2008). RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays Genome Research, 18 (9), 1509-1517 DOI: 10.1101/gr.079558.108

3.  Mason CE, Zumbo P, Sanders S, Folk M, Robinson D, Aydt R, Gollery M, Welsh M, Olson NE, & Smith TM (2010). Standardizing the next generation of bioinformatics software development with BioHDF (HDF5). Advances in experimental medicine and biology, 680, 693-700 PMID: 20865556

4.  Shi, L., et. al. (2010). The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models Nature Biotechnology, 28 (8), 827-838 DOI: 10.1038/nbt.1665

5.  Mane, S., Evans, C., Cooper, K., Crasta, O., Folkerts, O., Hutchison, S., Harkins, T., Thierry-Mieg, D., Thierry-Mieg, J., & Jensen, R. (2009). Transcriptome sequencing of the Microarray Quality Control (MAQC) RNA reference samples using next generation sequencing BMC Genomics, 10 (1) DOI: 10.1186/1471-2164-10-264

6.  Todd Smith.  May, 2009 FinchTalk.  Small RNAs get smaller.

7.  Kapranov, P., St. Laurent, G., Raz, T., Ozsolak, F., Reynolds, C., Sorensen, P., Reaman, G., Milos, P., Arceci, R., Thompson, J., & Triche, T. (2010). The majority of total nuclear-encoded non-ribosomal RNA in a human cell is 'dark matter' un-annotated RNA BMC Biology, 8 (1) DOI: 10.1186/1741-7007-8-149

Categories

More like this

I believe that Mason's opinion on the number of genes is an extreme position. There is certainly a lot of incomplete annotation, and the gene models are being improved as more alt-splicing data becomes available. The non-coding RNA genes are certainly underannotated, but the I believe that Mason maybe underestimating the amount of junk in the RNA-seq data. I'm not willing (yet) to believe that everything transcribed by a eukaryote is a gene. Not even tissue-specific expression is sufficient, as that may be driven by chromatin structure without implying function for everything transcribed.

Colour me skeptical on the alleged number of new genes, but Mason is certainly correct about the pervasiveness of problems with existing gene annotations. Throwing lots of transcriptome data at the problem will help, but ultimately the only way to resolve the issues is pain-staking manual reannotation by groups like HAVANA (who I've been working with in analysing predicted functional variants from the 1000 Genomes Project).

So long as the annotation errors remain in the databases, clinical sequencing projects are going to suffer from false positives and false negatives from bad gene models. I think there are a lot of people doing these projects right now who aren't aware how much of a problem this is; they'll learn pretty quickly...

Not to belittle all the small regulatory RNAs that were unknown 15 years ago, but it seems to me that the transcription machinery simply makes false positives all the time and transcribes junk DNA for nothing. Who was it who came up with the headline "Cells awash in useless RNA"?

By David MarjanoviÄ (not verified) on 08 Jan 2011 #permalink

It's mostly junk RNA resulting from inappropriate or spurious transcription.

When you have a lot of junk DNA in your genome, as we do, then it's inevitable that various transcription factors will bind to that junk DNA and stimulate low levels of transcription. The transcripts have no biological significance beyond the fact that they are accidents. We've been teaching the rules of non-specific binding for three decadesâhas everyone forgotten how proteins bind to DNA?

It's also inevitable that the splicing apparatus will make mistakes and current technology is quite capable of detecting the transcripts that are incorrectly spliced. This does not mean that you have detected biologically significant alternative splicing.

It's not a gene unless the transcript has a biological function. One way to tell this is to show that the same region of DNA is conserved and transcribed in other mammals.

@gasstationwithoutpumps, Daniel, and David:

What criteria would you use to decide that a transcript represents at gene?

@Larry: I agree about looking for conservation. I think Chris said that many (?) of these RNAs were found in pigmy tailed macaques, but I don't remember how many.

Sandra,

I agree that you need criteria to back up any extraordinary claim that our understanding of the human genome is totally wrong. What criteria did Chris Mason use to support his claim that we have up to 88,000 genes? What criteria (evidence) did he present to show that 98% of all genes are alternatively spiced?

I suspect he's simply defining a gene as a region of DNA that's complementary to a transcript that's been detected. I suspect he never mentioned whether those transcripts were present at a concentration of less than one per cell.

Am I right?

He listed several filters that were used, but I didn't write them down.

What I did record was that he said the transcripts for "new genes" had to be present in at least two tissue types and not be from repetitive sequences.

He's aware BTW that asserting these are new genes is controversial and he is open to discussion.

How redonkulous.

What criteria would you use to decide that a transcript represents at gene?

I deliberately avoided that word... but now I'm with comments 4 and 7.

By David MarjanoviÄ (not verified) on 09 Jan 2011 #permalink

Here's a question that I would like to add. For those of you who say the new transcripts are not genes:

Does your definition of a gene restrict genes to the sequences that have a clearly proven biological function?

And how do you know that these sequences do not have a function?

The gene concept should be abandoned in favor of base-level analyses -- which bases are expressed? which bases combine with which other bases to form what? to form which codons to form which proteins? which aggregate of bases interact with what to what effect?

By Anonymous (not verified) on 09 Jan 2011 #permalink

Sandra asks,

Does your definition of a gene restrict genes to the sequences that have a clearly proven biological function?

In my case the answer is "yes." What's the alternative? Would you like to call pseudogenes "genes"?

And how do you know that these sequences do not have a function?

There are ways of trying to find out if a given sequence has a function but there's no way of "proving" that it doesn't. (You can't prove a negative.) However, when someone makes a claim that there are 88,000 genes the onus is on them to support their claim with evidence. It's not up to me to prove that most of their sequences have no function. That's not how science works.

Greetings everyone,
I welcome the skepticism from everyone and the appreciation of the incompleteness of the current gene models, and I wanted to comment here on the hypotheses listed for far. Also, I wanted to say that I too, at first, was very skeptical of all the data, but I keep seeing the same genes expressed again and again, leading me to believe they are not just noise.
First, each one of these new genes is supported by at least two tissues and at least two sequencing platforms, so I believe they exist.
As to whether or not they are transcriptional noise or gDNA contamination, both seem unlikely since they often appear on the opposite strand from an adjacent gene (indicating independent regulation) or in a tissue-specific fashion. Also, about 80% of the genome has no transcription, and this is something that is different from what ENCODE is saying, so I don't think the RNA Pol II and other machinery are as sloppy as we think (though certainly consensus motifs lead to some errors).
As for function, I agree that we just don't know yet, but creating and curating the collection is the first step. This is a focus on ongoing work in our lab, and many others (like John Rinn's lab), but without question, that is the next step.
Lastly, what is a gene? I'll define it as a heritable, independent (doesn't splice with anything nearby it) transcriptionally active region that has been validated by at least two technologies. Though we require at least two tissues for most new gene annotations, I don't include this in my gene definition, since that would throw out any tissue-specific genes.
If anyone is going to be at ABGT, I'll be giving a similar talk there.

By Christopher Mason (not verified) on 10 Jan 2011 #permalink

Thanks Chris! I'm looking forward to reading the paper!

@Christopher Mason,

Can you tell us something about the abundance of the transcripts you detect? You are suggesting that there are more than 60,000 new genes in our genome that nobody was aware of. You base your estimation on the fact that you can detect complementary RNA from these regions in at least two tissues. How much RNA? Please give us an estimate of the number of complementary RNA fragments per cell.

Do you think all species have three or four times more genes than we previously thought or is it just mammalian genomes that have seen an expansion of the number of genes?

Here's my definition of a gene ... [What Is a Gene?]

A gene is a DNA sequence that is transcribed to produce a functional product.

According to this definition a transcribed pseudogene is NOT a gene and a region of DNA that's transcribed accidentally because it has a promoter-like sequence is NOT a gene, even if that transcription is reproducible. Using my definition, the onus is on the claimant to show that the RNA product has a biological function.

I'll have to agree with Larry. It appears that many of these RNAs are just transcriptional noise - if you sequence enough RNA you will probably see this same noise in a number of tissues, that doesn't mean that it is functional.

The real question is how does the cell filter out all this noise. Data from protein sequencing doesn't seem to indicate that there is as much alternative splicing as some studies show. The ENCODE data clearly shows that the non-coding transcripts (on average) are poorly conserved.

On the other hand the eukaryotic cell is clearly set up to filter this noise, but much more work needs to be done. Studies indicate that many of the products of alternative splicing are degraded by NMD. How mRNAs are selected for nuclear export is extremely complicated - we still don't understand the rules, but the more an RNA looks like it has an ORF, intrans and short exons, and a well processed 3' end, the more likely it is to be exported. That all points to a sophisticated triage system to sort junk from real stuff.

I will also add that in many cases there may be some role in a region being transcribed (such as in lincRNA) to help reinforce chromatin state. What is important for such transcription is not the transcript per se but the promoter. Furthermore the maintenance of that heterochromatic or euchromatic state may not even be of any functional significance - but rather part of a feed-back regulatory switch that plays a role in a small fraction of the genome. It is highly probable that these feedback loops are turned on or off spuriously in other genomic regions without any functional consequence.

By A Palazzo (not verified) on 10 Jan 2011 #permalink

Also a short quibble about Figure 1. All of the non-coding RNA listed in the figure (outside of rRNA and tRNA) are involved in RNA-DNA processing. There are other functional non-coding RNAs that act outside of nucleic acid metabolism - for example SRP RNA and TERT RNA. The big hope of the RNA community is that many other ncRNA beasts exist. Even adding 10-20 new ones would be a significant advance ... but to our chagrin projects like ENCODE and other deep sequencing projects have yet to uncover these.

By A Palazzo (not verified) on 10 Jan 2011 #permalink

Thanks Alex!

It's been interesting to see what new information emerges as we get more data.

I don't quite understand how one would get transcriptional noise from the anti-sense strand, but certainly, there's more work to be done to understand what's going on.

It will be interesting when the functional roles of these transcripts (if there functional roles) get sorted out.

As Mason points out the problems of existing gene annotations; it is true that the manual-reannotation of the genes is the only way to resolve the issues that can crop up due to the variants of the same gene. Errors in the existing annotations in various databases can lead to ambiguous analytical results leading to many erroneous conclusions in any experimental designs based on such analysis. Manual Reannotation of genes as well as their variants & representative aliases after thorough reading can help resolve any such errors in the databases, which is exactly where many of the "machines" (automatic annotation) fails.

May be it is good to follow a system like that of the ELN to manage and keep the versions of the annotations and edits distinct, trackbed in addition to the above suggestions