Using protein blast and searching for Elvis, part I.

In which we search for Elvis, using blastp, and find out how old we would have to be to see Elvis in a Las Vegas club.

Introduction

Once you're acquainted with proteins, amino acids, and the kinds of bonds that hold proteins together, we can talk about using this information to evaluate the similarity between protein sequences. We can easily imagine that if two protein sequences are identical, then those proteins would have the same kind of activity. But what about proteins that are similar in some regions, and not others, or proteins that only share some of the same amino acids in similar positions?

We know that the sequences of individual proteins can change through evolution. But how much can a protein change before it's no longer the same protein? Where do we draw the line between proteins that might be related and proteins that aren't related at all?

To answer these kinds of questions, we use statistical tools to help us compare sequences and measure the probability of change at any amino acid position and evaluate the likelihood of finding a match between two proteins. These same kinds of tools can also be used, with a few changes, to compare sequences of DNA. The most commonly used program for comparing molecular sequences, either proteins or DNA, is BLAST.


What are the BLAST programs?

BLAST stands for "Basic Local Alignment Search Tool." Different versions of BLAST are used to compare different kinds of sequences to each other. Blastn, for example, compares nucleotide sequences. Blastp, compares sequences of amino acids. Some versions of blast contain algorithms that translate nucleic acids sequences into amino acids. These can be used to compare a translated nucleic acid sequence to a database of protein sequences (blastx), a protein sequence to a database of nucleic acid sequences (tblastn), or we can even compare a translated nucleic acid sequence to a translated database of nucleic acid sequences (tblastx).

The translation options are especially helpful when we're trying to predict features like splice sites, or determine the correct protein sequence, or even to evaluate the accuracy of a DNA sequence. Since we don't always know which strand is used as a template or which reading frame is used in a cell, blast predicts the amino acid sequences that could be produced from all six possible reading frames.

How does protein blast (blastp) work?

i-b2f7832f5a011d214c00d376e6168a71-cysteines_zinc.png

Imagine a protein that contains cysteines. Sometimes cysteines participate in disulfide bonds. These bonds can hold different parts of a protein close together in a single chain or different peptide chains together in a quaternary structure.

Cysteines can also participate in other activities, such as binding to zinc. You might imagine then, that if you have protein with a certain enzymatic activity, and it uses cysteines to bind to zinc; then other proteins with that activity might also use cysteines to bind to zinc.

So, if we line up the amino acids in our protein, with the sequence of amino acids in another,  we could give the other protein points for having cysteines in similar positions. (yeah!  points!) If the other protein had a different amino acid in that position, we would penalize it and take away points.

i-4627d47d04d9539992d221e5cf5cebf1-matching-amino-acids.png

Figure 1.  Each cysteine gets lots of points for matching.

I like to think of protein sequence comparisons like a game. Every time two amino acids match, you get points. Wherever they don't match, you lose points. Since cysteines are unusual, and really important, we get more points for a pair of cysteines than we might for other pairs of matching amino acids.

How many points is a pair of cysteines worth?

We assign points to matching pairs of amino acids, by using a scoring system that's a kind of matrix.  The scoring system that we use most often is called BLOSUM 62 and was obtained by the Henikoffs from observations of experimental data (1). They aligned sets of similar proteins and determined the frequency of different residues replacing each other. The BLOSUM 62 matrix is derived from those frequencies.

To calculate a score using the BLOSUM 62 matrix, you need to assign points for every pair of aligned residues. For each amino acid in your sequence, find one of the amino acids in the column on the side, then read across until you find the amino acid in the corresponding position for the other sequence.

i-8d5ef49cbfa37ced8de70820aa9ab920-matrix.png

Figure 2. The BLOSUM 62 matrix. A pair of cysteines is worth 9 points. If the other protein had a glutamic acid (E) instead of a cysteine, then there'd be a penalty of -4 points.

Let's calculate the score!

After the proteins are aligned, blastp counts up the points for each pair and adds them to obtain a score.  If my nickname was a sequence of amino acids (SANDY), and I found a perfect match, the aligned sequences would look like this:

SANDY
SANDY

and the score, where we add the points for each pair of amino acids (in the example below, ss = a pair of serines, aa = a pair of alanines, etc.)

 would be:   
ss(4) + aa(4) + nn(6) + dd(6) + yy(7) = 27

If I had an alignment like this: 

SANDY

SANTY

The score would be:  ss(4) + aa(4) + nn(6) + dt(-1) + yy(7) = 20

How many points would we get for matching ELVIS? 
Use the BLOSUM 62 matrix to calculate the score, then use BLAST to find out if your calculation was correct.

Searching for ELVIS
Elvis is a famous American icon who died years ago and reappears on a regular basis thanks to people who like to dress like him and sing in Las Vegas clubs.  Let's see if we can find Elvis and also see how old we would probably have to be if we wanted to see him in a club today, in Las Vegas, NV.  He's been reported in numerous places ever since his death, maybe we'll find him in a protein sequence.

To search for ELVIS:

1.  Go to the NCBI   blast home page and choose "protein blast."
2.  Type Elvis in the large text box.
3.  Scroll down the page and change the parameters as shown in the image and listed below.

As an amino acid sequence, Elvis is a bit short, but we can still find him in the protein database, if we change a few parameters in the search.

i-057ce983255f4c331ab6c52f929a2fa0-blast_settings2.png

Figure 3. Change these blastp settings

4. To summarize: 

a.  Deselect the short query option
b.  Change the Expect threshold to 5 million (type a 5 and six zeros without commas)
c.  Change the word size to 2 (this is a more sensitive search)
d.  Change the compositional adjustments setting to "no adjustments"

5.  Click the blue BLAST button to have blastp search for ELVIS.

6.  After a short time, your results should appear.  Scroll down the page and look for the column labeled "Score." 

Click any one of the links in that column to see the alignment between ELVIS and a protein from the database.
 
Each alignment will show the name of the protein and in a parentheses, after the score, a number that should match the number you calculated.

For example, you might see something like this:

i-319d1a36b48f22878919b4c065d37526-paren2.png

 

Did you find the answer?  How old do you need to be to see ELVIS in a Las Vegas club?

FYI - there is a bit more to scoring blastp results than this, that's why this is labeled part I.

Reference:
 1.  Henikoff, S., & Henikoff, J. (1993). Performance evaluation of amino acid substitution matrices Proteins: Structure, Function, and Genetics, 17 (1), 49-61 DOI: 10.1002/prot.340170108

Copyright@ Sandra Porter, 2009
Updated 3/26/2009 to clarify some of the instructions.

More like this

I calculated 21; the BLAST page's 100 top matches say 22. I also know that the legal age to enter casino showrooms in Las Vegas while Elvis was alive was 21. How may I account for this discrepancy?

By MaryOGrady (not verified) on 29 Mar 2009 #permalink

Hi, Dr. Porter,

I got the results, but I don't know how to anser the question "How old do you need to be to see ELVIS in a Las Vegas club?". Is it 125?

The results are belove:
>gb|ACO24093.1| maltodextrose utilization protein MalA [Streptococcus pneumoniae
Taiwan19F-14]
Length=266

Score = 12.7 bits (21), Expect = 2109595
Identities = 5/5 (100%), Positives = 5/5 (100%), Gaps = 0/5 (0%)

Query 1 ELVIS 5
ELVIS
Sbjct 121 ELVIS 125

Chunhong

By Chunhong Li (not verified) on 29 Mar 2009 #permalink

Hi, Dr. Porter,

Is it 21?

The 22 comes up when I fail to change the compositional adjustments to "No adjustment."
Whew! What a relief for lounge lizards everywhere!

By MaryOGrady (not verified) on 29 Mar 2009 #permalink

Hi, Dr. Porter,

I know what that means.
ee(5)+ll(4)+vv(4)+ii(4)+ss(4)=21

Hi Chunhong,

It didn't occur to me until reading some questions from you and another student, how much this activity assumes some knowledge of American culture. Sorry about that.

Elvis impersonators in Las Vegas, are likely to perform in bars and in the U.S. (in most states), you can't legally go into a bar unless you're at least 21.

Hi, Dr Porter
I can't find any examples and more useful informations about "compositional adjustement" details , could you explain differences between results for every of choosen options or give any helpful informations.

Sure tom. For my example, we ignored it.

But, if you have a protein with an unusually high number of certain amino acids, compositional adjustments can give you more accurate E values.

Here's the summary from the NCBI.

Hi,
I Googled this page because I was looking for an old Science letter that joked about using BLASTP to look for Elvis, and I hoped you'd linked to it. At least in a superficial skim, I didn't see a link to that old paper, so I was wondering whether you've ever seen it.
In any case, it's:
James B. Kaper and Harry L. T. Mobley
Science, New Series, Vol. 253, No. 5023 (Aug. 30, 1991), pp. 951-952
If you have access, you can read it through JSTOR:
http://www.jstor.org/stable/2878766?seq=1&Search=yes&term=elvis&list=hi…

By Warren Terra (not verified) on 19 Aug 2010 #permalink