Now on ScienceBlogs: The significance of 2/13

ScienceBlogs Book Club: Inside the Outbreaks

Discovering Biology in a Digital World

My thoughts on biology, teaching, life, and exploring the living world via the digital one. Only my opinions are represented by these postings, they do not represent the viewpoints of any funding agency or Geospiza, Inc.

Profile

Sandra Porter I am a digital biologist, teacher, and entrepreneur. My passion is developing instructional materials for 21st century biology (Digital World Biology).

Search

Follow digitalbio on Twitter

National Science Foundation projects

Bio-Link Bio-Link is an Advanced Technology Education center of Excellence that works to improve biotechnology and life science education in the community colleges.

My Bio-Link blog

bio-itest bio-itest is an ITEST project (Innovative Technology Experiences for Students and Teachers). We are developing curriculum that uses bioinformatics resources to explore genetic testing and DNA barcoding.

Scenario based learning

Digital World Biology

Digital World Biology produces educational materials that help students and biologists use bioinformatics resources to explore biology. We write books, produce tutorials, sell biology-related merchandise and give workshops.

DigitalBio Favorites

Recent Posts

Recent Comments

Categories

Blogroll

Science Education Groups

Keep up to date

Awards

Red Orbit






When you need to laugh

Interesting places

Locations of visitors to this page

Archives

« Dave and Janet showed theirs, so I'll show mine | Main | Finding scientific papers for free, part I »

Digital Biology Friday: hot plants and viruses, part IV

Category: BioinformaticsScience educationsequence analysis
Posted on: May 18, 2007 8:43 AM, by Sandra Porter

tags: , , , ,

Quick synopsis: A type of grass grows in Yellowstone National Park in hot (65° C), unfriendly soil. How the plant manages this feat is a mystery. What we do know, is that the grass can only tolerate high temperatures if it's been infected by a fungus, and the fungus has to be infected by an RNA virus. In the paper describing this discovery, the researchers provided the GenBank accession numbers for the viral sequences. I decided to see if I could find out more about the proteins and what they do. Read part I, part II, and part III.

And now, on with our story.

 Down the rabbit hole, we go, but:

We begin with a BLAST


I started the quest by using the accession numbers, from the paper, to get the GenBank records and the sequences. The authors of the paper had already found that one piece of viral RNA (RNA 1) codes for a protein that's likely to be a replicase (1). I confirmed this finding using blastp and found that the predicted proteins do contain a conserved replicase domain.

(Q: Why do I call this a "predicted" protein?

A: Because this is a conventional way of referring to a protein, whose existence has yet to be supported by physical data.)

A replicase, by the way, is a perfectly reasonable thing for a viral RNA to encode. All viruses have to have a way to get their genomes copied if they're going to be able to go off and infect new cells. They would use a replicase to make new copies of the RNA genome. 

That leaves the question of the other RNA (this virus has two pieces of RNA that were sequenced). I looked up the predicted protein sequences for the second RNA and used blastp to compare the predicted proteins to the sequences in the non-redundant database. I couldn't find anything for the smaller (168 aa protein), but the larger protein matched lots of hypothetical proteins from fungi. Some of those results are described here.
 

Accession number

RNA sequence

Protein GI number Result
EF120984 RNA 1 ABM92658 possible replicase, contains a conserved domain for an RNA dependent RNA polymerase
    ABM92659.1 possible replicase?  some matches to the catalytic region of an integrase
EF1209845 RNA 2 ABM92660.1

331 amino acids

matches lots of hypothetical fungal proteins with unknown functions
    ABM92661.1

168 amino acids

no match to anything

Lost in translation?

The next thing that I tried was blastx. BLASTX takes a DNA (or RNA) sequence, figures out all of the amino acid sequences that could be produced from all six ways of reading the sequence (we call this "translation") and then compares all the possible sequences to a database of protein sequences.

I tried blastx for two reasons.

First, many annotation errors are made because of DNA sequencing mistakes. If one or two bases are missing, the translation can be messed up. It would be like this sentence: "The fat cat sat on pat." If this sentence used the same reading frame and had a letter missing, it would read: "Thf atc ats ato npa." Imagine, now if I went to the library and tried to find a book with the phrase "Thf atc ats." If I had the right sentence, I would probably find Dr. Seuss.  If I used the messed up sentence, I'd be out of luck.

The public databases have lots of these kinds of mistakes in translation. In a perfect world, I would be able to get the trace data from the DNA sequences, look at it myself with FinchTV, evaluate the quality, and possibly reassemble the sequences. In the real world, much of this data is not publicly available. NCBI for example, only stores trace data for a small number of viruses - most of them influenza. But enough whining, let's move on.

A challenge with blastx, is that different organisms use different versions of the genetic code and it's not always possible to know which version is used by the organism that you're studying. NCBI offers a choice of 13 genetic codes but I didn't have any luck trying find which code would be used by my RNA virus or even the fungal host. After chewing on this for awhile, I picked "yeast nuclear" reasoning that the virus infects a fungus and yeast is a fungus.

Here are the results:

 

The top two matches (red bars) are to the predicted sequences that are deposited in GenBank.  They serve as a positive control, since they should match themselves.  

Scanning down the page, from top to bottom, I see that the next best matching sequences (naturally) are from hypothetical or putative proteins.  They had good E values, too, and it is reassuring, though, that they come from fungi (or possibly fungal viruses, I don't have enough data to know which it is)

Looking farther down, a couple of long sequences match both proteins.  Both are from rice and one is a transposon sequence.  They look like a good match and seem to fit my idea about a possible frame shift.  But nothing is known about these proteins, so I decide on another path.


Taking a random walk?
The next path, I stumbled on by accident. I was planning to look at some of the "hypothetical" and "putative" fungal sequences and see if they matched anything interesting, when I found something new.

I had called up the GenBank record for the 331 amino acid protein from RNA 2 and clicked "BLink."  Blink is short for "Blast link." BLink takes me to a database of pre-computed blastp results for my Curvularia protein.

I like to use Blink since it has lots of filters for viewing which sequences belong to which kingdom, which part of the protein aligns, which sequences have structures, and so on.  I decided that I would get a list these sequences and use those as queries for more searching.  So, I clicked the GI list button to get a set of sequences and instead got a surprise! 

3_related.gif

I never saw that Related Structures tab before!

What could it mean?

 Join us next Friday, when we go through looking glass and see what we can find there.

Reference:
1. Márquez, L., et. al. 2007 A Virus in a Fungus in a Plant: Three-Way Symbiosis Required for Thermal Tolerance Science 26: 513-515.

Copyright Geospiza, Inc.

Share on Facebook
Share on StumbleUpon
Share on Facebook

Post a Comment

(Email is required for authentication purposes only. On some blogs, comments are moderated for spam, so your comment may not appear immediately.)





ScienceBlogs

Search ScienceBlogs:

Go to:

Advertisement
Follow ScienceBlogs on Twitter

© 2006-2011 ScienceBlogs LLC. ScienceBlogs is a registered trademark of ScienceBlogs LLC. All rights reserved.