Now on ScienceBlogs: The Australian's War on Science 41

Seed Media Group

Discovering Biology in a Digital World

My thoughts on biology, teaching, life, and exploring the living world via the digital one. Only my opinions are represented by these postings, they do not represent the viewpoints of any funding agency or Geospiza, Inc.

Profile

Sandra Porter I am a microbiologist and molecular biologist turned tenured biotech faculty turned bioinformatics scientist turned entrepreneur. My passion is developing instructional materials for 21st century biology (Digital World Biology).

Search

Digital World Biology

Discover Biology with Bioinformatics


Subscribe to our newsletter


e-mail digitalbio at scienceblogs.com

use 'Digital World Biology' news as the subject

DigitalBio Favorites

Science Blogs School Fundraiser


link_donorschoose_small.gif


Recent Posts

Recent Comments

Categories

Blogroll

Science Education Groups

Keep up to date

Awards

Red Orbit

Digital Bio at Blogged

Wikio - Top Blogs - Sciences
Add Digital Bio to your Technorati Favorites!





Follow me on Twitter

When you need to laugh

Interesting places

The Tangled Bank
MicrobeWorld Radio

Locations of visitors to this page

Archives

« People who look like their dogs | Main | When Donors Choose, Science Education Wins »

Am I really related to Cleopatra? Qualitatively measuring DNA sequence quality

Category: BioinformaticsBiotechnologyGenomicsMitochondriasequence analysis
Posted on: July 5, 2006 7:00 AM, by Sandra Porter

What do genetic testing and genealogy have in common?

The easy answer is that they're both used by people who are trying to find out who they are, in more ways than one.

Another answer is that both tests can involve DNA sequence data.

And that leads us to another question. If the sequence of my mitochondrial DNA is only two bases different from Cleopatra's, am I really a distant relative? And how do I really even know that my mitochondrial DNA is only two bases different in the first place? What does having a DNA sequence really mean?

Students sequencing mitochondrial DNA
I wrote earlier about a project where students amplify mitochondrial DNA, send it to the Dolan DNA Learning Center for sequencing, search for related sequences, and then try to understand the results.

Understanding and interpreting the data, it seems, is the toughest part of the experiment. Especially since some aspects of DNA sequencing never seem to get discussed. Fortunately, the Dolan DNA Learning Center added a new link on their results page called "Trace file." Now, the students who participate in this project can download their own chomatogram trace files and begin to evaluate just how good (or bad) the data really are, and which regions of the data are good or bad.

After all, how confident can you be that Cleopatra was one of your maternal ancestors, if the critical data points are really crummy?


But what is DNA sequence quality, anyway? Aren't DNA sequences just A's, C's, G's, and T's?
Is that DNA sequence, that you might get from textbook, a test result, or a GenBank record at the NCBI, really as definitive as it looks? How do we know for certain that base number 11 is really a C and not a T?

DNA sequence

If you look at the DNA sequence on the right, all the bases (except for that N at position 12) look equally good.

But there can be differences. The NCBI stores different kinds of DNA sequences in GenBank and the quality of different kinds of sequences is measured in different ways. Some of the sequences in GenBank are genome sequences. Since technical limitations make it impossible to determine a genome from a single sequencing reaction, the sequence of a genome is a composite that's derived from the results of many different experiments. Genome sequences at the NCBI are usually pretty high quality, in part because the final sequence is composed of the highest quality bases, chosen from several different samples.

Other sequences in GenBank come from a single sample of DNA. Sometimes the sample contains only a single type of DNA, like a sample from cloned DNA. Other times, the sample be a mixture of different kinds of DNA. For example, if I were to pull out a piece of your hair and sequence the DNA from cells attached to the root, I would be sequencing a mixed sample, because those cells would contain DNA from two different sources, both your mother and your father. Since the sample isn't pure, the quality of the DNA sequence would be lower in places where the your mother's DNA and your father's DNA are different.


What can I see in a trace file?

A trace file is either the chromatogram file that was produced by a sequencing instrument, or it is a compressed version of a chromatogram file, that's been processed and contains only a fraction of the original data. Either way, a trace file always contains data that can be drawn as a graph with colorful peaks to represent the intensity of fluorescence from the labeled bits of DNA that passed in front of a laser in a sequencing instrument.

If we have a trace file, we can see that the DNA sequence contained in the file looks like this:

DNA sequence


and it comes from a trace that looks like this:

DNA traceThese images were produced by a freely-available trace-viewer program called FinchTV.

FinchTV helps us decide how confident we are about the accuracy of a DNA sequence, at least in a qualitative way.

[Potential bias alert!! My employer, Geospiza, developed FinchTV as part of an NIH-funded project. We give it away for free, to anyone.]


What does that colorful graph mean?

Each base (A, G, C, or T) in a sequence is identified when the sequencing instrument software analyzes the shape, position, and spacing of the peaks in the graph. "Base-calling" is the process of identifying the base with the strongest signal at a peak position.

(It's my suspicion that the person who coined the term "base-caller" was a probably a baseball fan. I think he or she must have considered an imaginary umpire calling fly balls or strikes, and thought, of course the "base caller" calls bases.)


But when I look at the peaks in the image, some of them are kind of jumbled up. I'm not very confident about the A and G bases at positions 95 and 97.

Neither am I.

Most DNA sequences have portions that are clear and easy to interpret and other regions that are more difficult to interpret, if not impossible. Sometimes the base-calling software, that comes with a DNA sequencing instrument, will identify a really poor quality base as an "N," but usually it tries to make some kind a guess. After all, if you pick a random base, it would always have a one in four chance of being right.

When DNA sequencers first began appearing in university sequencing labs, researchers didn't put much faith in base-calling software. They often printed the chromatogram traces and reviewed the results of every sequencing reaction information by eye (at least the reactions that produced data). If they found a base-calling "mistake" they would edit the base at the position of the mistake and change the data.


They change the data? Surely, they would keep a record of the original information from the DNA sequencing instrument?

Uh, no. Not always. In fact, many genome center labs didn't keep the original data files at all. They were too worried about the amount of disc space on their computers. In fact, one of the most common formats for storing trace data, the scf or Staden compressed format, only stores a portion of the information from a chromatogram file. If you want to write better algorithms for re-analyzing raw data, well, too bad, better find another project.


Isn't important to keep a record of your experiment and all the steps you take? Sometimes. Some labs and/or companies use a data management systems to track the edits and maintain an audit trail. But this kind of activity isn't important everywhere. It's probably far less important in a research lab than it might be in a company.


Why don't biologists trust the DNA sequencing machines?

They do, now, more and more. But there can still be questions with some kinds of DNA samples, as we'll discuss in an upcoming article. And, it wasn't that long ago, that I, and many of my colleagues were reading DNA sequences by eye from X-ray films and typing them into Word files on 128k Macs, or even worse, PCs with DOS. It doesn't surprise me at all that the first group of biologists, working with first-generation DNA sequencers, distrusted computer technology and the ability of a computer program to correctly interpret data.

The researchers who looked at the trace files simply believed that they were better at interpreting the data than the software in the sequencing instrument.


And sometimes they were right.


technorati tags: , , , , ,

Copyright Geospiza, Inc.

Share this: Stumbleupon Reddit Email + More

TrackBacks

TrackBack URL for this entry: http://scienceblogs.com/mt/pings/9509

Comments

1

Are you gonna start blogging on Phred? I have a sneaking suspicion this would be an excellent way to lose readers. Of course, what do I know? I blog on the coalescent.

Posted by: RPM | July 5, 2006 8:15 PM

2

Sorry, RPM, your comment got tagged as junk and I only now realized that it was there. Of course, I will blog on Phred - at least a little bit : )

Posted by: Sandra Porter | July 22, 2006 11:35 AM

Post a Comment

(Email is required for authentication purposes only. On some blogs, comments are moderated for spam, so your comment may not appear immediately.)





ScienceBlogs

Search ScienceBlogs:

Go to:

Advertisement
Follow ScienceBlogs on Twitter
Visit the Collective Imagination blog
Advertisement
Enter to win

© 2006-2009 Seed Media Group LLC. ScienceBlogs is a registered trademark of Seed Media Group. All rights reserved.

Sites by Seed Media Group: Seed Media Group | ScienceBlogs | SEEDMAGAZINE.COM