Is it just me or does every analysis that looks for over-represented gene ontology (GO) terms turn up transcription factors? It doesn’t matter if the study is looking for genes under positive selection or something else. It just seems like transcription factors are enriched in every dataset.


  1. #1 apalazzo
    November 13, 2007


    Every protein that I’ve ever work on (be it a cytoskeletal, RNA binding or membrane associated protein) has been described as a trnascription factor has in some crapy paper. In fact I’ve been contemplating writing a post entitled “Why Does Every Freakin’ Protein Have a Night Job as a Transcription Factor?”

  2. #2 RPM
    November 13, 2007

    Okay, so there are a bunch of proteins that are misannotated as transcription factors. Is it logical to assume that they are distributed randomly amongst all proteins? If so, this shouldn’t lead to the over-representation of transcription factors in various datasets.

  3. #3 p-ter
    November 13, 2007

    Is it just me or does every analysis that looks for over-represented gene ontology (GO) terms turn up transcription factors?

    I think it might just be you 🙂

    which papers are you thinking of?

  4. #4 GeoMor
    November 14, 2007

    A lot of GO enrichment analyses are biased by gene length. For example…if you’re looking for enrichment in genes that have, say, some miRNA binding site or other sequence motif, then longer genes are more likely to have such binding sites by chance. If you just use a hypergeometric distribution (treating every gene as an equivalent “ball in a bag”) to look for a GO enrichment, as is very common, the significance of long genes will be amplified. I am not sure if this applies to transcription factors, but metazoan nervous system genes tend to have long UTRs and come up (questionably) in these analyses all the time. Of course, one could argue that the longer UTRs might reflect the biology of more complex regulation and shouldn’t be argued away.

  5. #5 OneRandomScientist
    November 14, 2007

    I love reading papers from people who are too purely computational and get excited about enrichments in “macromolecular biosynthesis” or “cytosol”. Thanks for narrowing it down for us 🙂

  6. #6 Peter Ellis
    November 15, 2007

    Heh! The most common annotation in GO data is “unknown”. Summary: truth be told, we know sod all about what most genes do.

    The last GO analysis I ran showed enrichment for “unknown function”, “unknown component” and “unknown process”. Conclusion: we know less than sod all about my particular system…

  7. #7 CPatil@lbl.gov
    November 20, 2007

    Another possibility is that genomics folk tend to report enrichment of transcription factors as often as they possibly can. in part because it’s one of the easiest types of overrepresentation to weave into a story about how your set of upregulated genes is mechanistically involved in subject X.

    Also: Peter, I’m totally with you on the “unknown” genes. They’re routinely the most prevalent in my own GO-type analysis. Either I’m really breaking ground or completely barking up the wrong tree… 🙂


  8. #8 RPM
    November 26, 2007

    Here is another dataset.

New comments have been disabled.