Quantitative measures of DNA sequence quality

By sporte on July 10, 2006.

How did the human genome ever get finished if every one of the three billion bases had to be reviewed by human eyes?
In the early days of the human genome project, laboratory personnel routinely scanned printed copies of chromatograms, editing and reviewing all DNA sequences by eye. For more background, see the post on qualitative measures of DNA quality.

Later on, when the genome sequencing turned into a race, and the pace of DNA sequencing began to increase, some genome centers realized that it was too expensive and time consuming to have Ph.D. scientists, or even technicians, review all the printed chromatograms by eye and manually edit files.

Editing sequence files is still a common practice in some labs, but this usually depends on the volume of sequencing that the lab carries out.

It became clear that better methods were needed.

Who is this "phred" and what is his formula?

One of the first and most popular programs for assessing sequence quality was, and still is "phred." Phred (named from "Phil's revised editing program") was written by Phil Green at the University of Washington (1-3). After a chromatogram file has been processed by the software in a sequencing instrument, it can be evaluated by Phred. Phred uses information about the shape of a peak, the spacing between peaks, and the height of a peak to calculate a quality score for every base in a DNA sequence. The quality score is obtained by taking the log of the probability that the base call was an error and multiplying it by negative ten.

The formula for a Phred score is this: Q = -10 log₁₀ P(error)

So, for example, if there is a 1 in 10 chance of an error, P = 0.10, the Phred quality score (or usually just called a "Phred score") would be 10. A 1 in 100 chance of error, would have a quality score of 20, a 1 in 1000 chance of an error, 30, and so on.

There are other programs for determining quality values, too. Newer DNA sequencing instruments from ABI even come equipped with basecalling software, like the KB basecaller.

Let's see some Phred scores
I was lazy today and obtained Phred values for a sequence file by uploading my chromatogram file to a Finch Server (www.geospiza.com). (The Finch Server can run Phred automatically when sequences are uploaded.) (Licenses for Phred can be obtained from the UW, and I can run it on my computer, since it has UNIX, but like I said, I'm lazy.)

If I download the Phred-scored file from the Finch Server and look at the file in FinchTV, I can see that suspicious bases at positions 99 and 100 also have low quality values. The blue line indicates where the quality score equals 20. Simply put, above the line is good, below the line is not good.

FinchTV tells us that the quality values are 13 (shown on the right) and 10. So, there's a little less than a 1 in 10 chance that the base-calling software made a mistake. The data for these two bases still aren't very good, but now I know just how bad they are.

Can Phred improve my data or at least tell me more about bad data?
No. It's a computer program, not a miracle worker. Even Phred can't turn bad data turn into good data, but we can know which parts of the sequence are good and which parts are not. We can identify regions and bases that are questionable.

We can see more about this below, and see how the quality varies throughout the sequence, in a Finch Server quality graph for this chromatogram file. The middle of the sequence looks pretty good, but there are regions with lower quality sequences on each end.

Overall, as more and more people have started using DNA sequencing to learn about their ancestry and ask about their genetic likelihood to develop disease, it's becoming more and more important to know just how good the data are.

I will write more later about how we use quality information, but for now, next time, you look at a DNA sequence, like AAGATAGATAGAT, ask yourself: Which parts of that sequence can we be confident about? And how confident can we be?

References
1. Brent Ewing, LaDeana Hillier, Michael C. Wendl, and Phil Green. 1998. "Base-Calling of Automated Sequencer Traces Using Phred. I. Accuracy Assessment." Genome Res. 8: 175-185.

2. Brent Ewing and Phil Green. 1998. "Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities." Genome Res. 8: 186-194.

3. Peter Richterich. 1998. "Estimation of Errors in "Raw" DNA Sequences: A Validation Study." Genome Res. 8: 251-259.

technorati tags: bioinformatics, DNA, DNA sequencing, sequencing quality, chromatogram, phred, trace file

More like this

may you tell me what's the meaning of base spacing and lane options that appear in top of the chromatogram print page?
bests

Shohreh: I'm not sure what program you're referring to. Did you mean FinchTV or something else?

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

Glyphosate reduces soil biodiversity and decreases the proportion of native species (French)

More by this author

New home for Discovering Biology in a Digital World

October 30, 2017

Sometime in the next day or two, Scienceblogs will shut down. We've enjoyed the opportunity to blog here for the past 10+ years. Not to worry, @digitalbio and @finchtalk will continue blogging, but more so from their own site at Digital World Biology. The Scienceblogs posts have been reposted at…

Synbiobeta: The Future is Now

October 12, 2017

@synbiobeta concluded it’s #sbbsf17 annual meeting on synthetic biology Oct 5, 2017. The progress companies are making in harnessing biology as a platform for manufacturing and problem solving is world changing. Locations of Synbio Companies What is Synthetic Biology? Synthetic biology is a term…

Understanding the CRISPR Cas9 system

September 18, 2016

On Sept. 30th, I'm going to be co-presenting a Bio-Link webinar on Genome Engineering with CRISPR-Cas9 with Dr. Thomas Tubon from Madison College. If you're interested, Register here. Since my part will be to help our audience understand the basics of this system, I prepared a short tutorial with…

Zika virus, drug discovery, and student projects

March 8, 2016

It's well understood in science education that students are more engaged when they work on problems that matter. Right now, Zika virus matters. Zika is a very scary problem that matters a great deal to anyone who might want to start a family and greatly concerns my students. I teach a…

DNA: it's in your blood

February 28, 2016

Did you know small fragments of DNA are circulating in your blood stream? These short pieces of DNA are left behind after cells self-destruct. This self-destruction, or apoptosis, is a normal process. In the case of fetal development, certain cells in our hands die, leaving behind individual…