Updates to the BLAST for beginners tutorial

By sporte on July 12, 2007.

By now, many of you have probably seen the the new BLAST web interface at the NCBI. There are many good things that I can say about it, but there are a few others that caught me by surprise during my last couple of classes.

tags: blast, BLAST tutorial, science education

Because of these changes, and because I'm giving a workshop for teachers on BLAST at the Fralin Biotechnology Conference in Blacksburg, VA, next week, it seemed like a good time to update our animated BLAST tutorial at Geospiza Education and save myself some trouble.

I originally created the BLAST for beginners tutorial to accompany an activity called "BLASTing through the kingdom life." In this activity, students use blastn to compare an unidentified DNA sequence with sequences in the nr database. Some of the goals in this activity are for students to identify the unknown DNA sequence and to identify related sequences in other organisms.

Personally, I think it's pretty cool to find that a DNA sequence from a frog, for example, has a counterpart in a chicken.

The tutorial is also good for an activity that I call "Head, Shoulders, Knees, and Toes" where students identify unknown sequences and look for tissue or developmentally-specific gene expression.

Anyway, the new interface makes this slightly more complicated than it used to be.

No more doing things by default
To summarize the first set of changes, we can't use the default settings in BLAST anymore.

In the earlier incarnation, NCBI's BLAST server automatically searched a database with a large and varied collection of sequences.

Now, we have to pay attention. If we wish to look at sequences that come from organisms other than humans, (and we do!) we must choose the proper database.

We also need to adjust the stringency of the search. The current default setting for doing a nucleotide blast search asks that the sequences match pretty closely. This works great if we are looking for a sequence that's almost identical to the sequence we have, but if we want to find a closely related sequence, from a different organism, we won't find it by using the default setting.

In the tutorial, I show how to change these parameters and present a summary afterwards listing what's been changed. It's not hard, and I suppose at some level it's just as well to have to do this because you have to think about what you're doing a bit better than before. But you do have to remember to do it. It's like looking at gel box to make sure that the right electrodes are plugged in at the correct ends. You get used to it.

And it does add one more place for things to go wrong when your students do a search.

More and more information
The new BLAST results pages no longer present some bits of information (like the gi number) but some of the information that you do get is more useful, like the query coverage and Max % identity. I added a page to define these terms.

You can find worksheets, sets of taxonomically diverse or tissue specific unknown sequences, and the BLAST for beginners tutorial all right here.

Enjoy!

More like this

I'm just getting used to the new interface. I think i like it, but I'm not sure yet. The ability to save parameters seems like it will be a nice feature once I figure out how to use it (it appears to do it by default).

The current default setting for doing a nucleotide blast search asks that the sequences match pretty closely.

I think those are the parameters for megablast. If you use blastn, I think they're more relaxed.

I think those are the parameters for megablast. If you use blastn, I think they're more relaxed.

That's it exactly. The old interface allowed you to pick blastn directly from the blast home page.Now, if you choose to search a nucleotide database with a nucleotide sequence, the default choice is megablast.

I am a bit confused about why the "nr" non-redundant dataset no longer attempts to include all the data available. For example, the chimpanzee genome and other complete genome data sets are not a part of "nr".

If you want to find out what species a particular endogenous virus (See AY692036 for example) has been found in, you have to BLAST against "nr" and then BLAST whole genomes.

You can BLAST against whole genome shotgus sequences, but the reuslting "hits" are from entries that are not well annotated. Some of the entries do not tell what chomosome they came from, for example.

It is better to go to each genome seperately, such as
http://www.ncbi.nlm.nih.gov/projects/genome/seq/BlastGen/BlastGen.cgi?t…
for Rhesus macaque genome, and BLAST against each one.

From the NCBI description, the nr data set contains:
All GenBank+RefSeq Nucleotides+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences). No longer "non-redundant".

Why did the nr database get reorganized? I don't know, but I would guess that bandwidth has something to do with it. BLAST is heavily used and I'm sure the NCBI gets complaints when people feel their searches are taking too long.

Why wouldn't the nr data set include genome sequences? and Why are the whole genome shotgun sequences annotated so poorly?

I think the answers to these questions are connected to the way that genome sequences are produced.

At least some of the sequences in the nr database will come from smaller scale research projects where people are looking at one gene, or a few genes. Many of the sequences will probably even derived from single reads, where the sequences might not require assembly, or they'll be obtained from low redundancy sequencing projects where a region was only sequenced from both strands, a single time.

Genome sequences, on the other hand, are obtained from more assembly-line, production style facilities (there's even a movie about the one at Washington University). One group chops up the sequences and makes the libraries, another group does the reactions, another group might load the gels, a different group assembles the sequences. All of this can occur without any knowledge of where the sequences map in the genome or what they do. The GSS (genome shot gun) sequences are deposited from an intermediate step in this process, before the sequences get assembled or finished. In a sense, they're almost anonymous sequences and I wouldn't really expect there to be much annotation, especially since the people involved in producing them rarely know what they are.

After the smaller shotgun sequences are produced, they get assembled into larger, longer sequences, such as contigs, chromosomes, and eventually genomes. This is where annotations are added.

It's important in a research project to figure out all the different data sets that you should blast against and make a plan so that you can schedule searches and updates automatically and try out different parameters. I think this is why people have always liked the BLAST server/data management system that we sell, since it allows you to do that, store all your results, and use your own data sets.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

New home for Discovering Biology in a Digital World

October 30, 2017

Sometime in the next day or two, Scienceblogs will shut down. We've enjoyed the opportunity to blog here for the past 10+ years. Not to worry, @digitalbio and @finchtalk will continue blogging, but more so from their own site at Digital World Biology. The Scienceblogs posts have been…

Synbiobeta: The Future is Now

October 12, 2017

@synbiobeta concluded it’s #sbbsf17 annual meeting on synthetic biology Oct 5, 2017. The progress companies are making in harnessing biology as a platform for manufacturing and problem solving is world changing. Locations of Synbio Companies What is Synthetic Biology? Synthetic biology is a term…

Understanding the CRISPR Cas9 system

September 18, 2016

On Sept. 30th, I'm going to be co-presenting a Bio-Link webinar on Genome Engineering with CRISPR-Cas9 with Dr. Thomas Tubon from Madison College. If you're interested, Register here. Since my part will be to help our audience understand the basics of this system, I prepared a…

Zika virus, drug discovery, and student projects

March 8, 2016

It's well understood in science education that students are more engaged when they work on problems that matter. Right now, Zika virus matters. Zika is a very scary problem that matters a great deal to anyone who might want to start a family and greatly concerns my students. I…

DNA: it's in your blood

February 28, 2016

Did you know small fragments of DNA are circulating in your blood stream? These short pieces of DNA are left behind after cells self-destruct. This self-destruction, or apoptosis, is a normal process. In the case of fetal development, certain cells in our hands die, leaving behind individual…