Digital Biology Friday: Free to evolve?

This is a fun puzzle. The pink molecule is a protein and the other molecule is a nucleic acid.

i-5c8403a3a835e477cc79b9d41e34f702-Picture 2.png

If I gave you the amino acid sequence of this protein, or the nucleotide sequence of this nucleic acid, what is the probability of finding a similar sequence in a different species (picked at random)?

A. High
B. Medium
C. Low
D. It depends on the database that you're searching.

You can have more than one answer.

Now, here's the hard part. Explain why you think your answer is correct.

More like this

If you're searching the protein database, you won't find the NA. And the reverse is also true.
It depends on what you mean by "similar;" it also depends on what the protein is, which nucleic acid that is, and what the NA codes for. Conservation of sequence identity is much higher for certain proteins and nucleic acid sequences than others. It all depends.

Okay - I'll add a few more criteria.

1. You are searching the database that contains the appropriate molecule - that is, you're looking for nucleotides in a nucleotide database and proteins in a protein database.

2. What do I mean by similar? This can be kind of fuzzy and I haven't looked at a large enough sample so my guesses may be off. For now, let's say for the protein that at least 30% of the amino acids are either identical or conserved. For the nucleic acid, how about 60% identity.

Naively I'd say:
A. for the nucleotide sequence
C. for the protein sequence
and D. for both (since the size of the database will have a huge effect on the above probabilities).

That based on the assumptions below:

1. It looks like you're searching for ~8 bases in comparison to err... ~100 amino acids?
2. There are 130 billion bases in Genbank/Refseq (from ~20 Million sequences)
3. There are a significantly smaller number of aminoacids in UniProtKB/TrEMBL (~ 4 Million proteins - not sure on the aa count).
4. There are only 4 bases in the nucleotide "alphabet".
5. There are 20 amino acids in the protein "alphabet".

So, if all 20 million entries in Genbank were 8 mers then we'd expect to find er... 1/65536 * 20,000,000 = 305 hits? (naive probability of any given 8 mer is 1/4 * 1/4 * 1/4 * 1/4 * 1/4 * 1/4 * 1/4 * 1/4 = 1/65536)

How does this look for the protein sequence?

Naive probability of any given 100 aa sequence = 1/100^20 = 7.9E-131;
7.9E-131 * 4,000,000 = 3.1E-124 hits. Not that many then...

Paul: that's an interesting way to approach the problem, but the answer isn't strictly related to probability.

Look for the answer in the biology.

Is that the "bottom" section of a t-RNA molecule? I just assumed that it was a transcription factor bound to DNA. If that protein is a ribosome subunit as well, then I'd go for:

A: protein
B: RNA (I don't know much about t-RNA conservation across species)

Actually, the identity of the molecules does matter as much as the relationship between them (from my perspective).

If this is something like a histone/DNA interaction that's essential to the viability of the organism then the likelihood of finding sequence conservation (for either nucleotide or protein) is high for any randomly picked organism. But if this is a transcription factor interacting with a promoter/inhibitor element and isn't essential for the viability of the organism then the likelihood of finding any hits is lower. Especially if this example came from a species which has experienced genome duplication in the past.

You can't assume that transcription factors are unessential. Lots of transcription factors, e.g. the HOX genes are essential for viability.

Anyway, the molecules that you see above are both essential.

lol! I wasn't implying that all TFs are unessential, I was saying that *some* are unessential and more "Free to Evolve".

You're getting close, I think. These two molecules are essential, but many essential molecules can still evolve - at least in some areas.

Certainly, some of that freedom is related to the copy number number. In this case, though, copy number isn't the critical factor that controls whether these can change or not.

Here's another hint: the answer is in the picture

From your given definition of "similar" as 30% identity/conserved for amino acids, 60% identity nucleotides -- then for amino acids, medium to high, depending on the actual protein, and for nucleotides, very, very low.

First off, the amino acid sequence can be 100% identical, but the nucleotide sequence 60% identical or below at the same time, solely through 3rd-base wobble. Synonymous mutations aren't selected against (at least they usually aren't in most circumstances, as far as we can tell?), so they can accrue freely. Also, though, it depends on whether or not you are looking at a eukaryotic gene, and whether you are actually looking at the raw genomic sequence as opposed to cDNA/mRNA, and if you include introns and 5' and 3' UTRs in your definition of the gene. If it is eukaryotic, you are looking at the raw genomic sequence, and you are including untranslated functional elements and introns, then there could be a very high degree of change indeed without ever affecting the aa sequence.

On the amino acid side, however, the main aspect of conservation is going to be the functional sites -- and it is going to depend on whether this is a "core function" protein, or something where high variation is actually adaptive, like a recognition protein in the innate immune system, where the evolution of multiple/variant binding sites is a way good thing. Otherwise, the amino acids in the functional sites will constrain variation, but as long as the functional faces are presented to the environment appropriately (e.g. not blocked by a novel fold or turn in the secondary structure) the sequence surrounding the functional sites can vary. Nevertheless, because there are non-synonymous mutations which result in an incompatible amino acid changing the folding (or making necessary folding impossible, like a hydrophilic to hydrophobic change in a residue), this too can be constrained by necessity.

And after that, it just depends on the evolutionary distance you pick.

Is that ok for a 30-second summary?

By Luna_the_cat (not verified) on 09 Dec 2007 #permalink

Sorry, I just realised -- the discussion I included about nucleotide sequence doesn't really apply here, we're not talking so much about any given gene. Blame beer?

By Luna_the_cat (not verified) on 09 Dec 2007 #permalink

It appears that the protein is interacting with the backbone of the nucleic acid. It would seem to me that the nucleic acid sequence is unimportant to the interaction because it all has the same phosphate backbone. This would mean that there is a low probability of finding the same nucleic acid sequence. If this is true then the protein sequence is probably well conserved. If a proteins job is to bind nucleic acid backbone, it would reach its maximum potential in early evolutionary time and undergo purifying selection. As long as the NA backbone does not change, which it hasn't, the protein would not have an evolutionary advantage if it changed. I would say you would have a high probability of finding the AA sequence in different organisms.