Digital Biology Friday: Free to evolve?

By sporte on December 7, 2007.

This is a fun puzzle. The pink molecule is a protein and the other molecule is a nucleic acid.

i-5c8403a3a835e477cc79b9d41e34f702-Picture 2.png

If I gave you the amino acid sequence of this protein, or the nucleotide sequence of this nucleic acid, what is the probability of finding a similar sequence in a different species (picked at random)?

A. High
B. Medium
C. Low
D. It depends on the database that you're searching.

You can have more than one answer.

Now, here's the hard part. Explain why you think your answer is correct.

More like this

If you're searching the protein database, you won't find the NA. And the reverse is also true.
It depends on what you mean by "similar;" it also depends on what the protein is, which nucleic acid that is, and what the NA codes for. Conservation of sequence identity is much higher for certain proteins and nucleic acid sequences than others. It all depends.

Okay - I'll add a few more criteria.

1. You are searching the database that contains the appropriate molecule - that is, you're looking for nucleotides in a nucleotide database and proteins in a protein database.

2. What do I mean by similar? This can be kind of fuzzy and I haven't looked at a large enough sample so my guesses may be off. For now, let's say for the protein that at least 30% of the amino acids are either identical or conserved. For the nucleic acid, how about 60% identity.

medium

That's a guess.

You can't get it right unless you can explain why you think the probability would be medium.

Naively I'd say:
A. for the nucleotide sequence
C. for the protein sequence
and D. for both (since the size of the database will have a huge effect on the above probabilities).

That based on the assumptions below:

1. It looks like you're searching for ~8 bases in comparison to err... ~100 amino acids?
2. There are 130 billion bases in Genbank/Refseq (from ~20 Million sequences)
3. There are a significantly smaller number of aminoacids in UniProtKB/TrEMBL (~ 4 Million proteins - not sure on the aa count).
4. There are only 4 bases in the nucleotide "alphabet".
5. There are 20 amino acids in the protein "alphabet".

So, if all 20 million entries in Genbank were 8 mers then we'd expect to find er... 1/65536 * 20,000,000 = 305 hits? (naive probability of any given 8 mer is 1/4 * 1/4 * 1/4 * 1/4 * 1/4 * 1/4 * 1/4 * 1/4 = 1/65536)

How does this look for the protein sequence?

Naive probability of any given 100 aa sequence = 1/100^20 = 7.9E-131;
7.9E-131 * 4,000,000 = 3.1E-124 hits. Not that many then...

Paul: that's an interesting way to approach the problem, but the answer isn't strictly related to probability.

Look for the answer in the biology.

Is that the "bottom" section of a t-RNA molecule? I just assumed that it was a transcription factor bound to DNA. If that protein is a ribosome subunit as well, then I'd go for:

A: protein
B: RNA (I don't know much about t-RNA conservation across species)

Here's a clue: the identity of the molecules doesn't matter as much as the relationship between them.

Actually, the identity of the molecules does matter as much as the relationship between them (from my perspective).

If this is something like a histone/DNA interaction that's essential to the viability of the organism then the likelihood of finding sequence conservation (for either nucleotide or protein) is high for any randomly picked organism. But if this is a transcription factor interacting with a promoter/inhibitor element and isn't essential for the viability of the organism then the likelihood of finding any hits is lower. Especially if this example came from a species which has experienced genome duplication in the past.

You can't assume that transcription factors are unessential. Lots of transcription factors, e.g. the HOX genes are essential for viability.

Anyway, the molecules that you see above are both essential.

lol! I wasn't implying that all TFs are unessential, I was saying that *some* are unessential and more "Free to Evolve".

You're getting close, I think. These two molecules are essential, but many essential molecules can still evolve - at least in some areas.

Certainly, some of that freedom is related to the copy number number. In this case, though, copy number isn't the critical factor that controls whether these can change or not.

Here's another hint: the answer is in the picture

You have been memed.

From your given definition of "similar" as 30% identity/conserved for amino acids, 60% identity nucleotides -- then for amino acids, medium to high, depending on the actual protein, and for nucleotides, very, very low.

First off, the amino acid sequence can be 100% identical, but the nucleotide sequence 60% identical or below at the same time, solely through 3rd-base wobble. Synonymous mutations aren't selected against (at least they usually aren't in most circumstances, as far as we can tell?), so they can accrue freely. Also, though, it depends on whether or not you are looking at a eukaryotic gene, and whether you are actually looking at the raw genomic sequence as opposed to cDNA/mRNA, and if you include introns and 5' and 3' UTRs in your definition of the gene. If it is eukaryotic, you are looking at the raw genomic sequence, and you are including untranslated functional elements and introns, then there could be a very high degree of change indeed without ever affecting the aa sequence.

On the amino acid side, however, the main aspect of conservation is going to be the functional sites -- and it is going to depend on whether this is a "core function" protein, or something where high variation is actually adaptive, like a recognition protein in the innate immune system, where the evolution of multiple/variant binding sites is a way good thing. Otherwise, the amino acids in the functional sites will constrain variation, but as long as the functional faces are presented to the environment appropriately (e.g. not blocked by a novel fold or turn in the secondary structure) the sequence surrounding the functional sites can vary. Nevertheless, because there are non-synonymous mutations which result in an incompatible amino acid changing the folding (or making necessary folding impossible, like a hydrophilic to hydrophobic change in a residue), this too can be constrained by necessity.

And after that, it just depends on the evolutionary distance you pick.

Is that ok for a 30-second summary?

Sorry, I just realised -- the discussion I included about nucleotide sequence doesn't really apply here, we're not talking so much about any given gene. Blame beer?

I'll post my answer Friday.

It appears that the protein is interacting with the backbone of the nucleic acid. It would seem to me that the nucleic acid sequence is unimportant to the interaction because it all has the same phosphate backbone. This would mean that there is a low probability of finding the same nucleic acid sequence. If this is true then the protein sequence is probably well conserved. If a proteins job is to bind nucleic acid backbone, it would reach its maximum potential in early evolutionary time and undergo purifying selection. As long as the NA backbone does not change, which it hasn't, the protein would not have an evolutionary advantage if it changed. I would say you would have a high probability of finding the AA sequence in different organisms.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

Universities Can Agree On All Hate Speech Except Antisemitism

More by this author

New home for Discovering Biology in a Digital World

October 30, 2017

Sometime in the next day or two, Scienceblogs will shut down. We've enjoyed the opportunity to blog here for the past 10+ years. Not to worry, @digitalbio and @finchtalk will continue blogging, but more so from their own site at Digital World Biology. The Scienceblogs posts have been…

Synbiobeta: The Future is Now

October 12, 2017

@synbiobeta concluded it’s #sbbsf17 annual meeting on synthetic biology Oct 5, 2017. The progress companies are making in harnessing biology as a platform for manufacturing and problem solving is world changing. Locations of Synbio Companies What is Synthetic Biology? Synthetic biology is a term…

Understanding the CRISPR Cas9 system

September 18, 2016

On Sept. 30th, I'm going to be co-presenting a Bio-Link webinar on Genome Engineering with CRISPR-Cas9 with Dr. Thomas Tubon from Madison College. If you're interested, Register here. Since my part will be to help our audience understand the basics of this system, I prepared a…

Zika virus, drug discovery, and student projects

March 8, 2016

It's well understood in science education that students are more engaged when they work on problems that matter. Right now, Zika virus matters. Zika is a very scary problem that matters a great deal to anyone who might want to start a family and greatly concerns my students. I…

DNA: it's in your blood

February 28, 2016

Did you know small fragments of DNA are circulating in your blood stream? These short pieces of DNA are left behind after cells self-destruct. This self-destruction, or apoptosis, is a normal process. In the case of fetal development, certain cells in our hands die, leaving behind individual…