Teaching with the new BLAST

BLAST is a collection of programs that are used to compare sequences (DNA, RNA, or protein) to larger collections of sequences that are stored in databases. I've used BLAST as a teaching tool for many years, partly because it's become a standard tool for biological work and partly because it's very good at illustrating evolutionary relationships on a molecular level.

A few months ago, the NCBI changed the web interface for doing BLAST searches at their site. I wrote earlier about changes that I made to our animated tutorial in response to the new BLAST. Now, I want to mention some of the good and bad things about the change in design.

tags: ,

What I like about the new interface:
If you make yourself an account at the NCBI, your searches and results are now saved for 48 hours and available from links in a table when you log in. I really like this feature! We do a similar thing in our Finch software and I've always liked being able to store sets of results and work with them later.

Having your results stored in a database is really helpful when you want to do lots of searches with different parameters and compare the results. It's also convenient when you want to do the Julia Child thing. You can set up lots of searches ahead of time and your results will be stored at the NCBI and ready for your discussion.

i-9b3963dcefc657f8daadba6a02ea5f45-results.gif

This is a nice improvement. In the past, the NCBI only saved the search results for 24 hours and you had to keep track of the request ID. The new options are much nicer. I also like knowing the expiration time and having the option to save strategies that worked well.

What I don't like
1. If you're teaching and demonstrating live BLAST searches in front of a class, be sure to log out first! When you have your own account, all the parameters you use get stored and are presented on the input form as the default choices. If you forget this, you will automatically be using different parameters and databases for searching than your students.

This can lead to some very weird experiences that make your search results different from those of your students and make it challenging to figure out what's happening - especially in the middle of a lecture. To quote Charlie Brown: "AAARGH!"

So, if you're going to do a live search, log out first and save yourself the grief.

2. The other thing is that you must look at your results to see what you really did. I expected that checking an option in the web form would override other parameters, but it's not clear that it really happens.

Here's an example:

i-ffa81f17f5c3c005253075ee6e90974f-parameters.gif

I thought that clicking this check box would change the search parameters to match the ones that we used to use for looking at primer binding sites.

No. In fact, I'm not sure how this option changed the search.

Fortunately, you can see which parameters blast used by scrolling down to the very bottom of your search results and seeing what they were.

i-6bd17802bae75a4f7affe82303e2f565-results2.gif

You can't get all the information here. Nothing seems to record whether you used any kinds of filters - (low complexity, repeats, etc.) or not. You'll have to go back to the original web form to get that information. Still, you can find out quite a bit.

That's all for now. It's time for me to blast off.

More like this

I've used BLAST as a teaching tool for many years, partly because it's become a standard tool for biological work and partly because it's very good at illustrating evolutionary relationships on a molecular level.

It's good for lots of things but one of them isn't illustrating evolutionary relationships. BLAST will give you a crude similarity score between any two pairs of sequence but its alignment program is not good enough for meaningful evolutionary relationships and it doesn't handle multiple sequence alignments and accurate phylogenetic trees.

I disagree. You're making this concept out to be harder than it needs to be. Unless you're trying to publish an evolutionary study, you can use simpler tools. You don't need a multiple alignment program and you can draw simple cladograms with a piece of paper and a pencil.

Here's one example of how you can do this How similar are apes and humans?

You determined that human and chimp mitochondrial genomes are 91% identical using BLAST. You also determined that the human and gorilla mitochondrial genomes are 87% identical.

You might consider this to be a valid way of teaching evolutionary relationships but I don't. I don't trust those figures because I don't trust the BLAST alignment algorithm.

However, I agree with you that a simple similarity matrix, if correct, can be used to make a crude phylogenetic tree.

Can you explain why you think the BLAST alignment algorithm would give incorrect results?

I'm curious.

I agree there are more sensitive ways to identifying matching sequences, e.g. Cross_Match, FASTA, HMMR, VAST, etc. And, it is possible to make mistakes when interpreting your BLAST results or when picking the parameters, but I don't think that makes the algorithm invalid.

BLAST was first published in 1990, as a faster search tool than FASTA or FASTP and it's been and a standard tool for biological research for a long time. Is there something that the rest of the world should know?

No automated alignment program does as good a job as an informed researcher. They usually put in too many gaps. In the case of BLAST, the alignments are separated into small regions of high similarity while ignoring the regions between, which have to be counted when comparing species.

Here's a brief summary from the Wikipedia article on sequence alignment.

Word methods, also known as k-tuple methods, are heuristic methods that are not guaranteed to find an optimal alignment solution, but are significantly more efficient than dynamic programming. These methods are especially useful in large-scale database searches where it is understood that a large proportion of the candidate sequences will have essentially no significant match with the query sequence. Word methods are best known for their implementation in the database search tools FASTA and the BLAST family.

No automated alignment program does as good a job as an informed researcher.

I think if that's the level of precision that you're looking for in your alignments, you shouldn't be using sequence alignment anyway. It sounds like you want structural alignments.

The Conserved Domain database at the NCBI is a great tool for looking at these and working with these and many of the alignments have been curated and tweaked by the eyes of informed researchers. They take the alignments from PFAM and SMART and correct the sequence alignments by using structure data.

It sounds like you want structural alignments.

Nope. You align amino acid residues not structural motifs. Sometimes knowing the structures will help in aligning the residues but that doesn't happen very often. If the amino acid sequence identity falls below 20% then aligning on the basis of alpha helices and beta strands isn't going to reveal homology and people are fooling themselves if they think otherwise. (There are one or two exception to that generality.)

If anyone reduces the overall amino acid sequence similarity by shifting alignments to match structure then it's safe to say that they don't know what they're doing. Sequence always trumps structure.

VAST works by aligning secondary structure elements. It also provides quality metrics that you can use to evaluate an alignment, like the rmsd value. I find those kinds of things to be helpful.

Why not do an experiment and compare what you find in the Conserved Domain database with something else that you've aligned and adjusted manually?

We may not ever agree on this, but I've had lots experience with hearing strong opinions from people who were completely wrong. I've learned from that that I prefer using data as a guide for comparing different methods.

You determined that human and chimp mitochondrial genomes are 91% identical using BLAST. You also determined that the human and gorilla mitochondrial genomes are 87% identical.

I also found that the Bonobo and Pan troglodytes were 95% identical to each other and that they were 87.2 and 87 % identical to the gorilla mitochondria, respectively. I didn't show all the data in the post.

Sandra Porter says,

We may not ever agree on this, but I've had lots experience with hearing strong opinions from people who were completely wrong. I've learned from that that I prefer using data as a guide for comparing different methods.

Me too. How much experience do you have in analyzing evolutionary relationships using the standard techniques that most molecular evolutionary biologists use?

Do you think there's a reason why BLAST analyses aren't usually published in the leading molecular evolution journals or do you think that those who study evolution professionally are just a bunch of people with strong opinions who are wrong?

Larry,

It's dawned on me that you probably never actually read the activities that I wrote about in this post, and that might be why you've completely missed the point.

If I'm using phylogenetics in a professional capacity, I use the tools that have the best experimental support.

When I use BLAST as a teaching tool or when I write about teaching with BLAST, I'm not writing for graduate students or researchers who might be publishing papers in a journal on molecular evolution. I'm writing for college instructors and high school teachers who want to use bioinformatics activities into their classrooms.

My goal is to help teachers help students to understand more fundamental ideas. These kinds of concepts doesn't always require the newest, fanciest, or highest resolution tools. If I'm teaching about pH, I can use pH paper. If I'm teaching about the Beer-Lambert law, I can use test tubes with serial dilutions of dye. You can teach about column chromatography with filter paper. Likewise, if I'm trying to help students discover evolutionary principles, I can use physical characteristics and I can use sequence comparison programs like BLAST. Ideas can be communicated even in the absence of the highest resolution tools.

BLAST helps students see that:
1.Different organisms can have similar, almost identical gene sequences.
2.A single organism have genes that are similar to each other- i.e. genes can exist in families.
3.We can use the similarity between genes to investigate the relationships between different organisms and the relationships between different genes in the same organism.

In fact, what I'm doing is not that unusual or even heretical, for that matter. There's a really good activity from the National Academy of Sciences, where they count base differences by hand, and they don't even use software. It's published by the National Academy Press in a book called "Teaching about evolution". I've just developed versions of the activity that are more up-to-date.

At the end of the day, we all want students to understand that trees are models of evolution that we build based on the number of differences that we find between species. It's not a big deal, in this arena, if BLAST underestimates the exact number of differences or fails to create the best possible alignments. NCBI BLAST is easy for teachers and students to use, free, supported by a large team of engineers, and illustrates the points that we want students to understand.

Nope. You align amino acid residues not structural motifs. Sometimes knowing the structures will help in aligning the residues but that doesn't happen very often. If the amino acid sequence identity falls below 20% then aligning on the basis of alpha helices and beta strands isn't going to reveal homology and people are fooling themselves if they think otherwise. (There are one or two exception to that generality.)
**************************************************************************
A recent example of using a structural alignment and sequence alignment came out in PNAS where they used the structure of the pyrrolysyl-tRNA synthetase to place it amongst other aminoacyl-tRNA synthetases. The structural alignment was only a starting point for then a more detailed sequenced-based phylogeny. Human eye is still the best way for detailed refinement of multiple alignments. The programs all have their quarks & biases which a researcher has to go back by eye and look at and refine.

thanks ponderingfool,

That looks like an interesting paper.

I'm curious - how do you judge when to adjust the alignment? I understand how to do this with sequence assemblies and when using programs like phrap and consed.

But, when you're looking at multiple alignments by eye and tweaking them - how do you know what to change? Is it all intuition? What rules that you apply?

I'm really interested in knowing, since this is something that's always puzzled me.

The rules are based on what you know of the enzyme you are working on and what is known about evolution and biochemistry. The human eye is really good at seeing patterns. Many times the mistakes made in a program such as Clustal are fairly obvious, especially around gaps. You are looking for patterns that exist in the sequences but that the program did not find. You then use an sequence alignment editor to move the sequences to better fit the pattern. You then evaluate the new alignment and you keep doing this over and over again.

Here is a general how to ppt presentation on the matter from the EMBL "Making the Most of your Multiple Sequence Analysis" course by Aidan Budd and David Judge.

Larry I am sure can go into greater detail as I am not the expert he is on the subject.

For sure JalView often does a lovely job

For a couple of reasons, however, I still use alteration between clustalx and seaview much of the time when editing alignments.

Firstly, clustalx has the great "Show Low-Scorring Segments" option in the "Quality" menu - this is the best way I know to quickly spot problem regions in an alignment

Secondly, I like to be able to realign subsets of the columns of the alignment - particularly using a different set of alignment parameters. This is great when I am very happy with part of my alignment, but want to see what clustal can do with a problematic region.

Having said that, I would typically align the sequences using something like probcons, mafft, or muscle before checking them out in clustalx

BTW - check out http://osx.iusethis.com/app/clustalx for the recently-released version of clustalx - the first major release for many years

Oh, a new OSX version! Thank you so much for letting me know about it!

And those all very good points. I get tired of editing alignments in JalView, saving the edited version and reimporting it again to realign it after I've trimmed the sequences.

Ah - a quick check shows this doesn't seem to be the right link. If I can dig it out I'll post it again here...

I should add - what I describe also involves looking and do a bit of sequence changing in ClustalX, but doing the actual edits in seaview (remembering to start from the right side of the alignment... makes it much easier to find the regions you identify in clustalx as needing an edit when you're using seaview

By aidan budd (not verified) on 28 Nov 2007 #permalink