Updates to the BLAST for beginners tutorial

By now, many of you have probably seen the the new BLAST web interface at the NCBI. There are many good things that I can say about it, but there are a few others that caught me by surprise during my last couple of classes.

tags: , ,

Because of these changes, and because I'm giving a workshop for teachers on BLAST at the Fralin Biotechnology Conference in Blacksburg, VA, next week, it seemed like a good time to update our animated BLAST tutorial at Geospiza Education and save myself some trouble.

I originally created the BLAST for beginners tutorial to accompany an activity called "BLASTing through the kingdom life." In this activity, students use blastn to compare an unidentified DNA sequence with sequences in the nr database. Some of the goals in this activity are for students to identify the unknown DNA sequence and to identify related sequences in other organisms.

Personally, I think it's pretty cool to find that a DNA sequence from a frog, for example, has a counterpart in a chicken.

The tutorial is also good for an activity that I call "Head, Shoulders, Knees, and Toes" where students identify unknown sequences and look for tissue or developmentally-specific gene expression.

Anyway, the new interface makes this slightly more complicated than it used to be.

No more doing things by default

To summarize the first set of changes, we can't use the default settings in BLAST anymore.

In the earlier incarnation, NCBI's BLAST server automatically searched a database with a large and varied collection of sequences.

Now, we have to pay attention. If we wish to look at sequences that come from organisms other than humans, (and we do!) we must choose the proper database.

We also need to adjust the stringency of the search. The current default setting for doing a nucleotide blast search asks that the sequences match pretty closely. This works great if we are looking for a sequence that's almost identical to the sequence we have, but if we want to find a closely related sequence, from a different organism, we won't find it by using the default setting.

In the tutorial, I show how to change these parameters and present a summary afterwards listing what's been changed. It's not hard, and I suppose at some level it's just as well to have to do this because you have to think about what you're doing a bit better than before. But you do have to remember to do it. It's like looking at gel box to make sure that the right electrodes are plugged in at the correct ends. You get used to it.

And it does add one more place for things to go wrong when your students do a search.

More and more information
The new BLAST results pages no longer present some bits of information (like the gi number) but some of the information that you do get is more useful, like the query coverage and Max % identity. I added a page to define these terms.

You can find worksheets, sets of taxonomically diverse or tissue specific unknown sequences, and the BLAST for beginners tutorial all right here.


More like this

I'm just getting used to the new interface. I think i like it, but I'm not sure yet. The ability to save parameters seems like it will be a nice feature once I figure out how to use it (it appears to do it by default).

The current default setting for doing a nucleotide blast search asks that the sequences match pretty closely.

I think those are the parameters for megablast. If you use blastn, I think they're more relaxed.

I think those are the parameters for megablast. If you use blastn, I think they're more relaxed.

That's it exactly. The old interface allowed you to pick blastn directly from the blast home page.Now, if you choose to search a nucleotide database with a nucleotide sequence, the default choice is megablast.

I am a bit confused about why the "nr" non-redundant dataset no longer attempts to include all the data available. For example, the chimpanzee genome and other complete genome data sets are not a part of "nr".

If you want to find out what species a particular endogenous virus (See AY692036 for example) has been found in, you have to BLAST against "nr" and then BLAST whole genomes.

You can BLAST against whole genome shotgus sequences, but the reuslting "hits" are from entries that are not well annotated. Some of the entries do not tell what chomosome they came from, for example.

It is better to go to each genome seperately, such as
for Rhesus macaque genome, and BLAST against each one.

By Brian Foley (not verified) on 12 Jul 2007 #permalink

From the NCBI description, the nr data set contains:
All GenBank+RefSeq Nucleotides+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences). No longer "non-redundant".

Why did the nr database get reorganized? I don't know, but I would guess that bandwidth has something to do with it. BLAST is heavily used and I'm sure the NCBI gets complaints when people feel their searches are taking too long.

Why wouldn't the nr data set include genome sequences? and Why are the whole genome shotgun sequences annotated so poorly?

I think the answers to these questions are connected to the way that genome sequences are produced.

At least some of the sequences in the nr database will come from smaller scale research projects where people are looking at one gene, or a few genes. Many of the sequences will probably even derived from single reads, where the sequences might not require assembly, or they'll be obtained from low redundancy sequencing projects where a region was only sequenced from both strands, a single time.

Genome sequences, on the other hand, are obtained from more assembly-line, production style facilities (there's even a movie about the one at Washington University). One group chops up the sequences and makes the libraries, another group does the reactions, another group might load the gels, a different group assembles the sequences. All of this can occur without any knowledge of where the sequences map in the genome or what they do. The GSS (genome shot gun) sequences are deposited from an intermediate step in this process, before the sequences get assembled or finished. In a sense, they're almost anonymous sequences and I wouldn't really expect there to be much annotation, especially since the people involved in producing them rarely know what they are.

After the smaller shotgun sequences are produced, they get assembled into larger, longer sequences, such as contigs, chromosomes, and eventually genomes. This is where annotations are added.

It's important in a research project to figure out all the different data sets that you should blast against and make a plan so that you can schedule searches and updates automatically and try out different parameters. I think this is why people have always liked the BLAST server/data management system that we sell, since it allows you to do that, store all your results, and use your own data sets.