If you've just joined us, we're in the middle of a quest to find the identity of an unknown nucleotide sequence. To summarize our results so far, we used this sequence to do a blastn search of GenBank, using all the default settings at the NCBI. You can see the beginning of the project here.
And we had some rather curious results.
It appeared that our sequence matched sequences from very diverse organisms, like Dengue virus, E. coli, and Simian Immunodeficiency virus. Very strange!
There was another curious word, too, that appeared in the descriptions for each of the results.
That word was VECTOR. "Vector" is a word that I imagine Sherlock Holmes would have used if he wanted to interrogate a scientist or mathematician and find out what they did without having them realize that he was trying to do so.
To a mathematician or a physicist, a vector is a straight line with a magnitude and direction. To a public health official, a vector is a rat, mouse, louse, or insect; anything capable of carrying a disease.
And, to a molecular biologist, a vector can be a plasmid, phage, or eucaryotic virus that is used to move genes around from place to place. This information can help us make some good guesses about the function of our unknown bit of DNA, because vectors have been engineered to have some common features. Some of these are special DNA sequences that allow plasmids to be copied. Some of the special features are genes that encode for enzymes that make bacteria resistant to different antibiotics. If a bacterial cell contains a plasmid with one of these antibiotic resistance genes, it produces a protein that allows it to live in the presence of an antibiotic. These features are helpful for biologists because we can select bacteria that are resistant to a drug and kill off all the rest.
Okay, where were we?
Back to our results:
Here is our list of matching sequences from the blastn search. We had some good guess last week about answers, and one was right, but involved far too much work.
I think it's far easier to look at the data.
We click the link to the alignment score.
This shows us where our sequences match each other. Pay attention to the positions of the subject sequence that match our query! We need to remember this. Our sequence starts matching at 44, 246 and ends matching at 44, 665.
Then we click the link to the matching sequence, and scroll down the page.
Eventually, we reach numbers. These numbers represent positions in the DNA sequence.
Here's the region where our sequence matches:
And our answer is, the beta lactamase gene. This gene codes for an enzyme that breaks the beta-lactam rings, thus disabling antibiotics like pencillin.
Well it is easier, but still you do not know exactly what part of your DNA sequence is matching to the annotated protein.
To know that it is much better to do a blast search against a protein DB. Then you will have information about the conservation of your sequence, which can be also useful.
And after that you can use PFAM to be sure that the protein have a "functional" conserved domain.
As you have it, it would be like the very first step, but then you have to carry on, and verify your initial findings using more specific tools.
Actually, you can look at the GenBank record and see how the DNA sequence corresponds to the encoded protein. I show it here.
I agree, PFAM is helpful if you're trying to understand the function of a truly unknown protein, or if your match isn't as good as it was in this case (100%). I also really like the Conserved Domain Database.
Hey, thanks for the really informative posts. I've been trying to get a handle on this stuff for a while, and seeing these tasks done in context just made it all click.