Checking out the new Ebola virus and playing some tricks with BLAST

Ebola virus has impressed me as creepy ever since I read "The Hot Zone: A Terrifying True Story some years back by Richard Preston. (I guess he has a new book, too, Panic in Level 4: Cannibals, Killer Viruses, and Other Journeys to the Edge of Science but I haven't been in airport for the past couple of weeks, so I haven't read it yet.)

Technorati Tags: , , ,

ResearchBlogging.orgInfectious agents that cause diseases with gruesome symptoms really excite those of us with an interest in microbiology. Tara has written about this paper, too, and summarized the details.

I thought I'd show you how to have fun re-analyzing the data, demonstrate a new and unexpected feature that I happened to find in NCBI BLAST, and see if we can reproduce the phylogenetic tree from the paper by using the tree algorithms at the NCBI. Making phylogenetic trees is often kind of painful in the classroom, for various reasons, and I wanted to see if we could find a more user-friendly method.

Bone picking

First, I want to point out that the authors of this paper (1) were a bit negligent concerning their materials and methods. This is irksome, although not uncommon where the bioinformatics methods are concerned. (You know, we did computer stuff, it's all magic anyway.)

If you notice here, in this figure from the paper, there are 14 different genome sequences listed in the tree. I would expect to find the accession numbers for all 14 sequences in the paper.

i-386a6ea36f986bf963efb024c6fa8297-ebola_tree.gif

Did I find the accession numbers in the paper?

No, I found six out of the 14, less than half. I only found two of the Marburg sequences in GenBank, and I'm not positive that those were the same ones in the paper. I think the reviewers were sloppy in this regard. How can an experiment be repeated if the materials aren't described? or even available? I would have thought the reviewers would at least look to see if the accession numbers were in the paper (they're not).

The paper gives the impression that complete genomes were used to create the tree in their figure. If that's true, it's hard to see where the data lives.

Still, I found the six most important genomes and a couple from Marburg virus (2), so I had some material to work with.

Learning from our mistakes

The reason I wanted these genomes was that I wanted to see if I could reproduce the published tree by using the tree analysis algorithms at the NCBI.

My first attempt failed miserably. I would enter my query in the usual place and then enter the accession numbers for the other viruses as an Entrez query.

The problem was, that none of the BLAST databases, I queried, contained the set of viral sequences that I wanted to see.

Actually, as it turned out, one database did contain some of the sequences, but the others were in a different database. Since NCBI BLAST only allows me to query one database, I can't search the data set that I want to see.

(At least that's what I thought.)

This difficulty with comparing sequences from different databases in one BLAST experiment has long been a source of frustration for me. If I'm doing this for work, I just make my own database and use our Finch Software for running BLAST or I run BLAST on my Mac.

When I'm teaching a class, though, I don't want to have to make students learn UNIX and install BLAST on their laptops and we haven't put BLAST on the student version of our software.

New tricks with BLAST
Luckily, I noticed that BLAST has something new.

i-11a771668c264042ae2408362200692d-new-checkbox.gif

I had ignored this new checkbox because I didn't want to compare two sequences.

However, that was a mistake. That checkbox is useful!

When I clicked it, another window opened up.

i-954a8a7bf72e9792aad451120638893a-new_window.gif

Now, I could enter the accession numbers for my new sequences!

Notice, too, the format. I tried some different ways for entering the numbers.

This method: Accession1, Accession2, Accession3..... Did not work.

This method: Accession1 Accession2 Accession3..... Did not work.

I could only get BLAST to work if I entered the sequences like this:

Accession1
Accession2
Accession3

I don't know why that is, but it really did work. I used these sequences:

NC_002549
AY354458
NC_006432
FJ217162
NC_004161
NC_001608
DQ447653

with the newly discovered virus, FJ217161, as the query.

Here are my results:

i-80a0de7e6e0ffbd27b8d05058f55d6fb-ebola_blast.gif

and, when I click the Distance tree of results link in BLAST, and use the default tree settings, I get a tree with the same shape and arrangement as the one in the paper (1).

i-2e9c90289f5c287dcd0b19f3eecd1a1e-tree_results.gif

Formatting the tree

Nothing is perfect of course. It's impossible to make the font size from the NCBI tree large enough to read.

Luckily, you can download the tree in the Newick format and make a pretty picture with the combination of NJ plot and a graphics program like Adobe Illustrator.

Here it is after some formatting. I highlighted the new virus.

i-6ddd72b37fde3f91bf7af895fb91ee61-blast_tree_newick.gif

Conclusions:

1. This new blast feature, where you can blast against your own set of sequences, is really helpful.

2. You can make the correct trees with the Mimimum Evolution Algorithm at the NCBI, but you will need to format your trees (i.e. make them pretty) somewhere else if you need pretty pictures.

References:

  1. Jonathan S. Towner, Tara K. Sealy, Marina L. Khristova, César G. Albariño, Sean Conlan, Serena A. Reeder, Phenix-Lan Quan, W. Ian Lipkin, Robert Downing, Jordan W. Tappero, Samuel Okware, Julius Lutwama, Barnabas Bakamutumaho, John Kayiwa, James A. Comer, Pierre E. Rollin, Thomas G. Ksiazek, Stuart T. Nichol (2008). Newly Discovered Ebola Virus Associated with Hemorrhagic Fever Outbreak in Uganda PLoS Pathogens, 4 (11) DOI: 10.1371/journal.ppat.1000212
  2. J. S. Towner (2006). Marburgvirus Genomics and Association with a Large Hemorrhagic Fever Outbreak in Angola Journal of Virology, 80 (13), 6497-6516 DOI: 10.1128/JVI.00069-06

More like this

I feel your pain.
Lack of accession numbers in the literature is a constant problem for me as well. What are journal editors thinking not checking for these before the paper's published? I work with the spotty zebfrafish genome, where at NCBI exons and so forth are rarely mapped, so it's especially frustrating.

how is this method different from other phylogeny tree program, eg Phylip?

Anon -

Phylip is a collection of at least 30 different programs. Some of these programs use different methods to make trees.

The NCBI offers many of the same methods for making these trees that are offered in Phylip along with some additional algorithms. They have more information at their site and descriptions of the programs if you're interested.

It's also nice that the work happens on the NCBI server and not on your computer. This takes care of some of the problems that I've run into when I've run Phylip on my computer.