Are Transcription Factors Enriched in Every Dataset?

Is it just me or does every analysis that looks for over-represented gene ontology (GO) terms turn up transcription factors? It doesn't matter if the study is looking for genes under positive selection or something else. It just seems like transcription factors are enriched in every dataset.

Tags

More like this

Rich,

Every protein that I've ever work on (be it a cytoskeletal, RNA binding or membrane associated protein) has been described as a trnascription factor has in some crapy paper. In fact I've been contemplating writing a post entitled "Why Does Every Freakin' Protein Have a Night Job as a Transcription Factor?"

Okay, so there are a bunch of proteins that are misannotated as transcription factors. Is it logical to assume that they are distributed randomly amongst all proteins? If so, this shouldn't lead to the over-representation of transcription factors in various datasets.

Is it just me or does every analysis that looks for over-represented gene ontology (GO) terms turn up transcription factors?

I think it might just be you :)

which papers are you thinking of?

A lot of GO enrichment analyses are biased by gene length. For example...if you're looking for enrichment in genes that have, say, some miRNA binding site or other sequence motif, then longer genes are more likely to have such binding sites by chance. If you just use a hypergeometric distribution (treating every gene as an equivalent "ball in a bag") to look for a GO enrichment, as is very common, the significance of long genes will be amplified. I am not sure if this applies to transcription factors, but metazoan nervous system genes tend to have long UTRs and come up (questionably) in these analyses all the time. Of course, one could argue that the longer UTRs might reflect the biology of more complex regulation and shouldn't be argued away.

I love reading papers from people who are too purely computational and get excited about enrichments in "macromolecular biosynthesis" or "cytosol". Thanks for narrowing it down for us :)

Heh! The most common annotation in GO data is "unknown". Summary: truth be told, we know sod all about what most genes do.

The last GO analysis I ran showed enrichment for "unknown function", "unknown component" and "unknown process". Conclusion: we know less than sod all about my particular system...

By Peter Ellis (not verified) on 14 Nov 2007 #permalink

Another possibility is that genomics folk tend to report enrichment of transcription factors as often as they possibly can. in part because it's one of the easiest types of overrepresentation to weave into a story about how your set of upregulated genes is mechanistically involved in subject X.

Also: Peter, I'm totally with you on the "unknown" genes. They're routinely the most prevalent in my own GO-type analysis. Either I'm really breaking ground or completely barking up the wrong tree... :-)

.