Am I really related to Cleopatra? Qualitatively measuring DNA sequence quality

What do genetic testing and genealogy have in common?

The easy answer is that they're both used by people who are trying to find out who they are, in more ways than one.

Another answer is that both tests can involve DNA sequence data.

And that leads us to another question. If the sequence of my mitochondrial DNA is only two bases different from Cleopatra's, am I really a distant relative? And how do I really even know that my mitochondrial DNA is only two bases different in the first place? What does having a DNA sequence really mean?

Students sequencing mitochondrial DNA
I wrote earlier about a project where students amplify mitochondrial DNA, send it to the Dolan DNA Learning Center for sequencing, search for related sequences, and then try to understand the results.

Understanding and interpreting the data, it seems, is the toughest part of the experiment. Especially since some aspects of DNA sequencing never seem to get discussed. Fortunately, the Dolan DNA Learning Center added a new link on their results page called "Trace file." Now, the students who participate in this project can download their own chomatogram trace files and begin to evaluate just how good (or bad) the data really are, and which regions of the data are good or bad.

After all, how confident can you be that Cleopatra was one of your maternal ancestors, if the critical data points are really crummy?

But what is DNA sequence quality, anyway? Aren't DNA sequences just A's, C's, G's, and T's?
Is that DNA sequence, that you might get from textbook, a test result, or a GenBank record at the NCBI, really as definitive as it looks? How do we know for certain that base number 11 is really a C and not a T?

i-23a852582a5d85dce2636141e608e577-qual_seq1.gif

If you look at the DNA sequence on the right, all the bases (except for that N at position 12) look equally good.

But there can be differences. The NCBI stores different kinds of DNA sequences in GenBank and the quality of different kinds of sequences is measured in different ways. Some of the sequences in GenBank are genome sequences. Since technical limitations make it impossible to determine a genome from a single sequencing reaction, the sequence of a genome is a composite that's derived from the results of many different experiments. Genome sequences at the NCBI are usually pretty high quality, in part because the final sequence is composed of the highest quality bases, chosen from several different samples.

Other sequences in GenBank come from a single sample of DNA. Sometimes the sample contains only a single type of DNA, like a sample from cloned DNA. Other times, the sample be a mixture of different kinds of DNA. For example, if I were to pull out a piece of your hair and sequence the DNA from cells attached to the root, I would be sequencing a mixed sample, because those cells would contain DNA from two different sources, both your mother and your father. Since the sample isn't pure, the quality of the DNA sequence would be lower in places where the your mother's DNA and your father's DNA are different.


What can I see in a trace file?

A trace file is either the chromatogram file that was produced by a sequencing instrument, or it is a compressed version of a chromatogram file, that's been processed and contains only a fraction of the original data. Either way, a trace file always contains data that can be drawn as a graph with colorful peaks to represent the intensity of fluorescence from the labeled bits of DNA that passed in front of a laser in a sequencing instrument.

If we have a trace file, we can see that the DNA sequence contained in the file looks like this:

i-8edbd7e75540d7a4d2fc92b97aeec78d-qual_seq2.gif

and it comes from a trace that looks like this:

i-41e0a80089d408e1e4d0f1da46264333-qual_seq3.gifThese images were produced by a freely-available trace-viewer program called FinchTV.

FinchTV helps us decide how confident we are about the accuracy of a DNA sequence, at least in a qualitative way.

[Potential bias alert!! My former employer, Geospiza, developed FinchTV as part of an NIH-funded project. We give it away for free, to anyone.]

What does that colorful graph mean?

Each base (A, G, C, or T) in a sequence is identified when the sequencing instrument software analyzes the shape, position, and spacing of the peaks in the graph. "Base-calling" is the process of identifying the base with the strongest signal at a peak position.

(It's my suspicion that the person who coined the term "base-caller" was a probably a baseball fan. I think he or she must have considered an imaginary umpire calling fly balls or strikes, and thought, of course the "base caller" calls bases.)

But when I look at the peaks in the image, some of them are kind of jumbled up. I'm not very confident about the A and G bases at positions 95 and 97.

Neither am I.

Most DNA sequences have portions that are clear and easy to interpret and other regions that are more difficult to interpret, if not impossible. Sometimes the base-calling software, that comes with a DNA sequencing instrument, will identify a really poor quality base as an "N," but usually it tries to make some kind a guess. After all, if you pick a random base, it would always have a one in four chance of being right.

When DNA sequencers first began appearing in university sequencing labs, researchers didn't put much faith in base-calling software. They often printed the chromatogram traces and reviewed the results of every sequencing reaction information by eye (at least the reactions that produced data). If they found a base-calling "mistake" they would edit the base at the position of the mistake and change the data.

They change the data? Surely, they would keep a record of the original information from the DNA sequencing instrument?

Uh, no. Not always. In fact, many genome center labs didn't keep the original data files at all. They were too worried about the amount of disc space on their computers. In fact, one of the most common formats for storing trace data, the scf or Staden compressed format, only stores a portion of the information from a chromatogram file. If you want to write better algorithms for re-analyzing raw data, well, too bad, better find another project.

Isn't important to keep a record of your experiment and all the steps you take? Sometimes. Some labs and/or companies use a data management systems to track the edits and maintain an audit trail. But this kind of activity isn't important everywhere. It's probably far less important in a research lab than it might be in a company.

Why don't biologists trust the DNA sequencing machines?

They do, now, more and more. But there can still be questions with some kinds of DNA samples, as we'll discuss in an upcoming article. And, it wasn't that long ago, that I, and many of my colleagues were reading DNA sequences by eye from X-ray films and typing them into Word files on 128k Macs, or even worse, PCs with DOS. It doesn't surprise me at all that the first group of biologists, working with first-generation DNA sequencers, distrusted computer technology and the ability of a computer program to correctly interpret data.

The researchers who looked at the trace files simply believed that they were better at interpreting the data than the software in the sequencing instrument.

And sometimes they were right.

technorati tags: , , , , ,

Copyright Geospiza, Inc.

More like this

Are you gonna start blogging on Phred? I have a sneaking suspicion this would be an excellent way to lose readers. Of course, what do I know? I blog on the coalescent.

Sorry, RPM, your comment got tagged as junk and I only now realized that it was there. Of course, I will blog on Phred - at least a little bit : )

By Sandra Porter (not verified) on 22 Jul 2006 #permalink