hit counter joomla

This is a fun puzzle. The pink molecule is a protein and the other molecule is a nucleic acid.

i-5c8403a3a835e477cc79b9d41e34f702-Picture 2.png

If I gave you the amino acid sequence of this protein, or the nucleotide sequence of this nucleic acid, what is the probability of finding a similar sequence in a different species (picked at random)?

A. High
B. Medium
C. Low
D. It depends on the database that you’re searching.

You can have more than one answer.

Now, here’s the hard part. Explain why you think your answer is correct.


  1. #1 DAG
    December 7, 2007

    If you’re searching the protein database, you won’t find the NA. And the reverse is also true.
    It depends on what you mean by “similar;” it also depends on what the protein is, which nucleic acid that is, and what the NA codes for. Conservation of sequence identity is much higher for certain proteins and nucleic acid sequences than others. It all depends.

  2. #2 Sandra Porter
    December 7, 2007

    Okay – I’ll add a few more criteria.

    1. You are searching the database that contains the appropriate molecule – that is, you’re looking for nucleotides in a nucleotide database and proteins in a protein database.

    2. What do I mean by similar? This can be kind of fuzzy and I haven’t looked at a large enough sample so my guesses may be off. For now, let’s say for the protein that at least 30% of the amino acids are either identical or conserved. For the nucleic acid, how about 60% identity.

  3. #3 Trey
    December 7, 2007


  4. #4 Sandra Porter
    December 7, 2007

    That’s a guess.

    You can’t get it right unless you can explain why you think the probability would be medium.

  5. #5 Paul
    December 7, 2007

    Naively I’d say:
    A. for the nucleotide sequence
    C. for the protein sequence
    and D. for both (since the size of the database will have a huge effect on the above probabilities).

    That based on the assumptions below:

    1. It looks like you’re searching for ~8 bases in comparison to err… ~100 amino acids?
    2. There are 130 billion bases in Genbank/Refseq (from ~20 Million sequences)
    3. There are a significantly smaller number of aminoacids in UniProtKB/TrEMBL (~ 4 Million proteins – not sure on the aa count).
    4. There are only 4 bases in the nucleotide “alphabet”.
    5. There are 20 amino acids in the protein “alphabet”.

    So, if all 20 million entries in Genbank were 8 mers then we’d expect to find er… 1/65536 * 20,000,000 = 305 hits? (naive probability of any given 8 mer is 1/4 * 1/4 * 1/4 * 1/4 * 1/4 * 1/4 * 1/4 * 1/4 = 1/65536)

    How does this look for the protein sequence?

    Naive probability of any given 100 aa sequence = 1/100^20 = 7.9E-131;
    7.9E-131 * 4,000,000 = 3.1E-124 hits. Not that many then…

  6. #6 Sandra Porter
    December 7, 2007

    Paul: that’s an interesting way to approach the problem, but the answer isn’t strictly related to probability.

    Look for the answer in the biology.

  7. #7 Paul
    December 7, 2007

    Is that the “bottom” section of a t-RNA molecule? I just assumed that it was a transcription factor bound to DNA. If that protein is a ribosome subunit as well, then I’d go for:

    A: protein
    B: RNA (I don’t know much about t-RNA conservation across species)

  8. #8 Sandra Porter
    December 7, 2007

    Here’s a clue: the identity of the molecules doesn’t matter as much as the relationship between them.

  9. #9 Paul
    December 7, 2007

    Actually, the identity of the molecules does matter as much as the relationship between them (from my perspective).

    If this is something like a histone/DNA interaction that’s essential to the viability of the organism then the likelihood of finding sequence conservation (for either nucleotide or protein) is high for any randomly picked organism. But if this is a transcription factor interacting with a promoter/inhibitor element and isn’t essential for the viability of the organism then the likelihood of finding any hits is lower. Especially if this example came from a species which has experienced genome duplication in the past.

  10. #10 Sandra Porter
    December 7, 2007

    You can’t assume that transcription factors are unessential. Lots of transcription factors, e.g. the HOX genes are essential for viability.

    Anyway, the molecules that you see above are both essential.

  11. #11 Paul
    December 7, 2007

    lol! I wasn’t implying that all TFs are unessential, I was saying that *some* are unessential and more “Free to Evolve”.

  12. #12 Sandra Porter
    December 7, 2007

    You’re getting close, I think. These two molecules are essential, but many essential molecules can still evolve – at least in some areas.

    Certainly, some of that freedom is related to the copy number number. In this case, though, copy number isn’t the critical factor that controls whether these can change or not.

    Here’s another hint: the answer is in the picture

  13. #14 Luna_the_cat
    December 9, 2007

    From your given definition of “similar” as 30% identity/conserved for amino acids, 60% identity nucleotides — then for amino acids, medium to high, depending on the actual protein, and for nucleotides, very, very low.

    First off, the amino acid sequence can be 100% identical, but the nucleotide sequence 60% identical or below at the same time, solely through 3rd-base wobble. Synonymous mutations aren’t selected against (at least they usually aren’t in most circumstances, as far as we can tell?), so they can accrue freely. Also, though, it depends on whether or not you are looking at a eukaryotic gene, and whether you are actually looking at the raw genomic sequence as opposed to cDNA/mRNA, and if you include introns and 5′ and 3′ UTRs in your definition of the gene. If it is eukaryotic, you are looking at the raw genomic sequence, and you are including untranslated functional elements and introns, then there could be a very high degree of change indeed without ever affecting the aa sequence.

    On the amino acid side, however, the main aspect of conservation is going to be the functional sites — and it is going to depend on whether this is a “core function” protein, or something where high variation is actually adaptive, like a recognition protein in the innate immune system, where the evolution of multiple/variant binding sites is a way good thing. Otherwise, the amino acids in the functional sites will constrain variation, but as long as the functional faces are presented to the environment appropriately (e.g. not blocked by a novel fold or turn in the secondary structure) the sequence surrounding the functional sites can vary. Nevertheless, because there are non-synonymous mutations which result in an incompatible amino acid changing the folding (or making necessary folding impossible, like a hydrophilic to hydrophobic change in a residue), this too can be constrained by necessity.

    And after that, it just depends on the evolutionary distance you pick.

    Is that ok for a 30-second summary?

  14. #15 Luna_the_cat
    December 9, 2007

    Sorry, I just realised — the discussion I included about nucleotide sequence doesn’t really apply here, we’re not talking so much about any given gene. Blame beer?

  15. #16 Sandra Porter
    December 10, 2007

    I’ll post my answer Friday.

  16. #17 skerr
    December 13, 2007

    It appears that the protein is interacting with the backbone of the nucleic acid. It would seem to me that the nucleic acid sequence is unimportant to the interaction because it all has the same phosphate backbone. This would mean that there is a low probability of finding the same nucleic acid sequence. If this is true then the protein sequence is probably well conserved. If a proteins job is to bind nucleic acid backbone, it would reach its maximum potential in early evolutionary time and undergo purifying selection. As long as the NA backbone does not change, which it hasn’t, the protein would not have an evolutionary advantage if it changed. I would say you would have a high probability of finding the AA sequence in different organisms.

New comments have been temporarily disabled. Please check back soon.