In which we search for Elvis, using blastp, and find out how old we would have to be to see Elvis in a Las Vegas club.
Once you’re acquainted with proteins, amino acids, and the kinds of bonds that hold proteins together, we can talk about using this information to evaluate the similarity between protein sequences. We can easily imagine that if two protein sequences are identical, then those proteins would have the same kind of activity. But what about proteins that are similar in some regions, and not others, or proteins that only share some of the same amino acids in similar positions?
We know that the sequences of individual proteins can change through evolution. But how much can a protein change before it’s no longer the same protein? Where do we draw the line between proteins that might be related and proteins that aren’t related at all?
To answer these kinds of questions, we use statistical tools to help us compare sequences and measure the probability of change at any amino acid position and evaluate the likelihood of finding a match between two proteins. These same kinds of tools can also be used, with a few changes, to compare sequences of DNA. The most commonly used program for comparing molecular sequences, either proteins or DNA, is BLAST.
What are the BLAST programs?
BLAST stands for “Basic Local Alignment Search Tool.” Different versions of BLAST are used to compare different kinds of sequences to each other. Blastn, for example, compares nucleotide sequences. Blastp, compares sequences of amino acids. Some versions of blast contain algorithms that translate nucleic acids sequences into amino acids. These can be used to compare a translated nucleic acid sequence to a database of protein sequences (blastx), a protein sequence to a database of nucleic acid sequences (tblastn), or we can even compare a translated nucleic acid sequence to a translated database of nucleic acid sequences (tblastx).
The translation options are especially helpful when we’re trying to predict features like splice sites, or determine the correct protein sequence, or even to evaluate the accuracy of a DNA sequence. Since we don’t always know which strand is used as a template or which reading frame is used in a cell, blast predicts the amino acid sequences that could be produced from all six possible reading frames.
How does protein blast (blastp) work?
Imagine a protein that contains cysteines. Sometimes cysteines participate in disulfide bonds. These bonds can hold different parts of a protein close together in a single chain or different peptide chains together in a quaternary structure.
Cysteines can also participate in other activities, such as binding to zinc. You might imagine then, that if you have protein with a certain enzymatic activity, and it uses cysteines to bind to zinc; then other proteins with that activity might also use cysteines to bind to zinc.
So, if we line up the amino acids in our protein, with the sequence of amino acids in another, we could give the other protein points for having cysteines in similar positions. (yeah! points!) If the other protein had a different amino acid in that position, we would penalize it and take away points.
Figure 1. Each cysteine gets lots of points for matching.
I like to think of protein sequence comparisons like a game. Every time two amino acids match, you get points. Wherever they don’t match, you lose points. Since cysteines are unusual, and really important, we get more points for a pair of cysteines than we might for other pairs of matching amino acids.
How many points is a pair of cysteines worth?
We assign points to matching pairs of amino acids, by using a scoring system that’s a kind of matrix. The scoring system that we use most often is called BLOSUM 62 and was obtained by the Henikoffs from observations of experimental data (1). They aligned sets of similar proteins and determined the frequency of different residues replacing each other. The BLOSUM 62 matrix is derived from those frequencies.
To calculate a score using the BLOSUM 62 matrix, you need to assign points for every pair of aligned residues. For each amino acid in your sequence, find one of the amino acids in the column on the side, then read across until you find the amino acid in the corresponding position for the other sequence.
Figure 2. The BLOSUM 62 matrix. A pair of cysteines is worth 9 points. If the other protein had a glutamic acid (E) instead of a cysteine, then there’d be a penalty of -4 points.
Let’s calculate the score!
After the proteins are aligned, blastp counts up the points for each pair and adds them to obtain a score. If my nickname was a sequence of amino acids (SANDY), and I found a perfect match, the aligned sequences would look like this:
and the score, where we add the points for each pair of amino acids (in the example below, ss = a pair of serines, aa = a pair of alanines, etc.)
ss(4) + aa(4) + nn(6) + dd(6) + yy(7) = 27
If I had an alignment like this:
The score would be: ss(4) + aa(4) + nn(6) + dt(-1) + yy(7) = 20
How many points would we get for matching ELVIS?
Use the BLOSUM 62 matrix to calculate the score, then use BLAST to find out if your calculation was correct.
Searching for ELVIS
Elvis is a famous American icon who died years ago and reappears on a regular basis thanks to people who like to dress like him and sing in Las Vegas clubs. Let’s see if we can find Elvis and also see how old we would probably have to be if we wanted to see him in a club today, in Las Vegas, NV. He’s been reported in numerous places ever since his death, maybe we’ll find him in a protein sequence.
To search for ELVIS:
1. Go to the NCBI blast home page and choose “protein blast.”
2. Type Elvis in the large text box.
3. Scroll down the page and change the parameters as shown in the image and listed below.
As an amino acid sequence, Elvis is a bit short, but we can still find him in the protein database, if we change a few parameters in the search.
Figure 3. Change these blastp settings
4. To summarize:
a. Deselect the short query option
b. Change the Expect threshold to 5 million (type a 5 and six zeros without commas)
c. Change the word size to 2 (this is a more sensitive search)
d. Change the compositional adjustments setting to “no adjustments”
5. Click the blue BLAST button to have blastp search for ELVIS.
6. After a short time, your results should appear. Scroll down the page and look for the column labeled “Score.”
Click any one of the links in that column to see the alignment between ELVIS and a protein from the database.
Each alignment will show the name of the protein and in a parentheses, after the score, a number that should match the number you calculated.
For example, you might see something like this:
Did you find the answer? How old do you need to be to see ELVIS in a Las Vegas club?
FYI – there is a bit more to scoring blastp results than this, that’s why this is labeled part I.
1. Henikoff, S., & Henikoff, J. (1993). Performance evaluation of amino acid substitution matrices Proteins: Structure, Function, and Genetics, 17 (1), 49-61 DOI: 10.1002/prot.340170108