Now on ScienceBlogs: Let the War on Christmas being. Atheist style.

Seed Media Group

Collective Imagination

Discovering Biology in a Digital World

My thoughts on biology, teaching, life, and exploring the living world via the digital one. Only my opinions are represented by these postings, they do not represent the viewpoints of any funding agency or Geospiza, Inc.

Profile

Sandra Porter I am a microbiologist and molecular biologist turned tenured biotech faculty turned bioinformatics scientist turned entrepreneur. My passion is developing instructional materials for 21st century biology (Digital World Biology).

Search

Digital World Biology

Discover Biology with Bioinformatics


Subscribe to our newsletter


e-mail digitalbio at scienceblogs.com

use 'Digital World Biology' news as the subject

DigitalBio Favorites

Science Blogs School Fundraiser


link_donorschoose_small.gif


Recent Posts

Recent Comments

Categories

Blogroll

Science Education Groups

Keep up to date

Awards

Red Orbit

Digital Bio at Blogged

Wikio - Top Blogs - Sciences
Add Digital Bio to your Technorati Favorites!





Follow me on Twitter

When you need to laugh

Interesting places

The Tangled Bank
MicrobeWorld Radio

Locations of visitors to this page

Archives

« Careers in biotechnology, part VI. More opinions on bioinformatics in a software company | Main | SciVee: more beta than YouTube? »

Is phred dead? Let's see the data

Category: BioinformaticsBiotechnologysequence analysis
Posted on: August 21, 2007 8:25 AM, by Sandra Porter

Blogging on Peer-Reviewed Research
If you've read the previous posts on this topic, here and here, you're probably aware by now that I have this weird (okay, maybe fanatical) obsession with data. Or at least, with knowing if my data are right so I can get on with life, do the analysis and figure out the results.

My results from last week suggested that re-processing chromatogram data (from the ABI 3730) with phred was probably a bad idea, but still, I only had one data point and I really wanted to know if anyone had done a more thorough study and compared larger numbers of chromatograms.

Naturally, someone had.

tags: , ,

And of course it was ABI. And, the results aren't even new (except to me, I guess).

ABI and their collaborators at the Washington University and Baylor College of Medicine genome centers presented this work in a poster at the Advances in Genome Biology and Technology (AGBT) meeting in 2004 at Marco Island (1).

They looked at basecalling performance with data from 20,000 chromatograms and concluded that:

  • 1. KB produced fewer errors.

  • 2. KB was able to call more bases, which resulted in longer reads.

It certainly puts my quick conclusion from one chromatogram to shame. Oh, why oh why don't I ever read those user bulletins?

Never mind that. ABI kindly gave me permission to post some of their data (2):

compare_callers.gif

These box and whisker plots show the results from chromatograms that were basecalled with the KB basecaller (on top, in blue), chromatograms from ABI instruments (without KB) that were re-processed by phred (in the middle, in red), and chromatograms that were first processed with KB, and then with phred (green, on the bottom) (this was the method that I used the other day with my one chromatogram).

In each case, they compared the read sequences that were obtained with a reference sequence in order to determine the error rate.

(What is a read? A read is a DNA sequence that's been obtained from a chromatogram file. The chromatogram file has lots of extra information like the kind of matrix, the run time, the name of the base calling program, the peak heights, etc. A read sequence only contains the sequence of bases: ATAGAGCTCATCGATCATCTACGTA.... etc. )

We can evaluate reads in a few ways.

  • We can look at the number of high quality bases (Q20, Q30, Q40).
  • We can look at the length of the read after trimming off the bad stuff.
  • And, we can compare the read to a known sequence and count the number of differences.

Part A in the figure shows the length of the read sequence after trimming the poor quality data (less than Q20) bases from both the 5' and 3' ends. In each case, it appears that the KB base caller gave longer reads. In this figure, it looks like the mean values were around 650, 775, and 950 bases for reads from short, medium, and long runs.

Part B shows the error rates. For the rapid runs (top), it looks like phred has a slightly lower mean error rate when it's used to re-process KB-called data. KB and re-processed KB data appear to be tied for the medium length runs and KB wins with the long runs.

To quote ABI: .

..since phred replaces (and ignores) the initial called sequence, re-processing KB-analyzed samples with phred will, on average, degrade the accuracy of the analysis in terms of actual sequence error. Analysis improvements provided by KB algorithm outlined above will be essentially lost.

There you have it, the end of this read and this sequence of posts at the same time. Time to move on to the next generation.

Reference:
1. Gehman, C. et. al. 2004 "Longer Reads with the KB Basecaller" AGBT 2004.
2. Applied Biosystems User Bulletin, FAQ KB Basecaller v1.2.

Share this: Stumbleupon Reddit Email + More

Comments

1

Figure 1 looks like it would be a nightmare for a colorblind reader.

Posted by: RPM | August 21, 2007 10:24 AM

2

I agree.

Posted by: Sandra Porter | August 21, 2007 10:55 AM

3

Thanks for the update Sandra. But shouldn't there also be a comparison of phred analysis of raw data compared to KB?
regards,
TJK

Posted by: Thomas Keller | August 22, 2007 12:23 PM

4

Hi Sandra

It is good to see someone looking critically at DNA basecallers. KB is certainly a better basecaller than phred, however, there are better basecallers out there than KB. At risk of tooting my own horn, our company sells a couple of basecallers (LongTrace and PeakTrace) that are better basecallers than either phred or KB. We have a free versions of the software on our website which you can try with your own traces - the links are below.

http://www.nucleics.com/peaktrace-sequencing/
http://www.nucleics.com/longtrace-sequencing/

Cheers

Daniel

Posted by: Daniel Tillett | August 22, 2007 7:24 PM

5

Hi Thomas,

You wrote: "shouldn't there also be a comparison of phred analysis of raw data compared to KB?"

Unfortunately, this can't be done. Phred can only work with data that have been previously processed by a sequencing instrument. The closest you can get to doing the experiment that you described is having phred work with data that have been processed on ABI instruments with base callers other than KB.

Posted by: Sandra Porter | August 22, 2007 11:22 PM

6

Thanks Daniel,

I'll take a look.

Posted by: Sandra Porter | August 22, 2007 11:24 PM

Post a Comment

(Email is required for authentication purposes only. On some blogs, comments are moderated for spam, so your comment may not appear immediately.)





ScienceBlogs

Search ScienceBlogs:

Go to:

Advertisement
Enter to win a free copy of The Monty Hall Problem
Visit the Collective Imagination blog
Advertisement
Collective Imagination

© 2006-2009 Seed Media Group LLC. ScienceBlogs is a registered trademark of Seed Media Group. All rights reserved.

Sites by Seed Media Group: Seed Media Group | ScienceBlogs | SEEDMAGAZINE.COM