Using protein blast and identifying unknown proteins, part II.

By sporte on March 26, 2009.

In which we identify unknown human proteins.

Yesterday, I wrote about using the BLOSUM 62 matrix to calculate a score for matches between two proteins. Those scores give us a good start on understanding how blastp determines whether two sequences are matching by chance or because they're more likely to be related. But that's not all there is to calculating a blast score, and there's at least one other statistic to consider as well, the E value.

It all comes down to biochemistry
The BLOSUM 62 matrix is based on the substitutions that really do or do not happen in real protein sequences. I want to point out that the ability of a protein to tolerate those replacements is related to the chemical properties of the amino acids in the pair. Naturally, we have a word to describe this. If an amino acid is replaced by one with a similar chemistry, then we say it's a "conservative change" and we use a "+" to show this in some representations of alignments. Replacing valine with leucine, for example would be conservative change since both amino acids have small, hydrophobic side chains. If the chemistry is different, it's a non-conservative change.

Now, onto the E value!
The E value is used as a way to normalize our results and to determine the number of sequences that would match as well as ours, if we were searching a database of random sequences. Of course our real databases are definitely not random, but people actually made databases of random sequences when they worked on the blast algorithms.

The key things that the E value corrects for are the length of our query sequence and the number and lengths of sequences in the database. The query sequence is the one that we're using for the search. The sequences we find are called the subject sequences. If the query sequence is short, it's likely to match more sequences in the database. Consequently, short sequences will have higher E values.

If the query sequence is long, then it might have a longer set of matching amino acids. This will give it a lower E value, since a longer match is less probable.

In terms of the database size, as a database gets larger, there is a greater chance that it will contain matching sequences. So, the E value goes up when you search larger sets of sequences and down when you look at smaller sequence sets.

One confusing point is that very low E values are represented as exponential numbers. A number presented as 9e-166 is really: 9 x 10^-166

Okay, that part isn't too confusing. The confusing part is when the E vales get to be very, very small. Eventually, they reach a point where there are so many digits in the exponent that it's possible to print the E value on the web page. At this point, the E value gets rounded off to zero. Be aware, the E value is never zero, it's just very, very close. If you have an E value close to zero, then your proteins are quite similar, maybe even the same protein from different species.

What does this all mean? If an E value is low, say below 0.01, then the match is significant. If the E value is higher, the match might still be significant, especially if you have a short sequence. If that's the case, you have to evaluate the results in the context of the experiment.

Using blastp to identify unknown proteins
If you're in my ACC class, your assignment will be to take a closer look at some of these oncogenes, using an industrial-strength form of protein blast, called "blastp." If you're not in my class, you can follow along for fun.

Instructions

1. Go to the NCBI home page.
2. Search with the query: human AND unknown
3. Click the link to the protein database.
4. How many unknown human proteins are listed?

5. Pick one of the sequences in the list, record the accession number (so you can find it later) and click the word "BLink" that appears on the right side of the page.

BLink is a link to all the results of blastp searches that were already done. Whenever a new protein sequence (or predicted) protein sequence enters GenBank, blastp is automatically run.

6. When you select BLink, you will get a list of all the results from blastp searches.

In the example below, I have multiple proteins with the identical blastp score. They might all be the same protein, or different forms of the same protein, or they might be the same protein in different species.

7. Pick one of the highest scoring sequences to work with. Go to the class Blackboard site and review the accession numbers from your classmates. If someone else has chosen your number, go back and find another protein to work with.

8. Click the link to the blastp score to see the alignment.

9. Look at the alignment and answer the following questions:

a. Are the two sequences the same length?
b. Do they align over the entire sequence or just in part of the sequence?
c. Do you think this blastp result is significant? Use the blastp score and the E value to justify this statement.

10. Click the link to the subject sequence that you found in the database. Read the description and the comments in the sequence record. You might look at the other matching sequences to get more details. Write a 1-2 paragraph description of the information you can find about this protein and it's function.

More like this

When the E value is higher, what does it mean to evaluate the results in the context of the experiment? in statistical terms?

what does it mean to evaluate the results in the context of the experiment? in statistical terms?

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

New home for Discovering Biology in a Digital World

October 30, 2017

Sometime in the next day or two, Scienceblogs will shut down. We've enjoyed the opportunity to blog here for the past 10+ years. Not to worry, @digitalbio and @finchtalk will continue blogging, but more so from their own site at Digital World Biology. The Scienceblogs posts have been…

Synbiobeta: The Future is Now

October 12, 2017

@synbiobeta concluded it’s #sbbsf17 annual meeting on synthetic biology Oct 5, 2017. The progress companies are making in harnessing biology as a platform for manufacturing and problem solving is world changing. Locations of Synbio Companies What is Synthetic Biology? Synthetic biology is a term…

Understanding the CRISPR Cas9 system

September 18, 2016

On Sept. 30th, I'm going to be co-presenting a Bio-Link webinar on Genome Engineering with CRISPR-Cas9 with Dr. Thomas Tubon from Madison College. If you're interested, Register here. Since my part will be to help our audience understand the basics of this system, I prepared a…

Zika virus, drug discovery, and student projects

March 8, 2016

It's well understood in science education that students are more engaged when they work on problems that matter. Right now, Zika virus matters. Zika is a very scary problem that matters a great deal to anyone who might want to start a family and greatly concerns my students. I…

DNA: it's in your blood

February 28, 2016

Did you know small fragments of DNA are circulating in your blood stream? These short pieces of DNA are left behind after cells self-destruct. This self-destruction, or apoptosis, is a normal process. In the case of fetal development, certain cells in our hands die, leaving behind individual…