Now on ScienceBlogs: The Future - And Present - of Maternal and Infant Health Care.

ScienceBlogs Book Club: Inside the Outbreaks

Discovering Biology in a Digital World

My thoughts on biology, teaching, life, and exploring the living world via the digital one. Only my opinions are represented by these postings, they do not represent the viewpoints of any funding agency or Geospiza, Inc.

Profile

Sandra Porter I am a digital biologist, teacher, and entrepreneur. My passion is developing instructional materials for 21st century biology (Digital World Biology).

Search

Follow digitalbio on Twitter

National Science Foundation projects

Bio-Link Bio-Link is an Advanced Technology Education center of Excellence that works to improve biotechnology and life science education in the community colleges.

My Bio-Link blog

bio-itest bio-itest is an ITEST project (Innovative Technology Experiences for Students and Teachers). We are developing curriculum that uses bioinformatics resources to explore genetic testing and DNA barcoding.

Scenario based learning

Digital World Biology

Digital World Biology produces educational materials that help students and biologists use bioinformatics resources to explore biology. We write books, produce tutorials, sell biology-related merchandise and give workshops.

DigitalBio Favorites

Recent Posts

Recent Comments

Categories

Blogroll

Science Education Groups

Keep up to date

Awards

Red Orbit






When you need to laugh

Interesting places

Locations of visitors to this page

Archives

« Careers in biotechnology, part VI. More opinions on bioinformatics in a software company | Main | SciVee: more beta than YouTube? »

Is phred dead? Let's see the data

Category: BioinformaticsBiotechnologysequence analysis
Posted on: August 21, 2007 8:25 AM, by Sandra Porter

Blogging on Peer-Reviewed Research
If you've read the previous posts on this topic, here and here, you're probably aware by now that I have this weird (okay, maybe fanatical) obsession with data. Or at least, with knowing if my data are right so I can get on with life, do the analysis and figure out the results.

My results from last week suggested that re-processing chromatogram data (from the ABI 3730) with phred was probably a bad idea, but still, I only had one data point and I really wanted to know if anyone had done a more thorough study and compared larger numbers of chromatograms.

Naturally, someone had.

tags: , ,

And of course it was ABI. And, the results aren't even new (except to me, I guess).

ABI and their collaborators at the Washington University and Baylor College of Medicine genome centers presented this work in a poster at the Advances in Genome Biology and Technology (AGBT) meeting in 2004 at Marco Island (1).

They looked at basecalling performance with data from 20,000 chromatograms and concluded that:

  • 1. KB produced fewer errors.

  • 2. KB was able to call more bases, which resulted in longer reads.

It certainly puts my quick conclusion from one chromatogram to shame. Oh, why oh why don't I ever read those user bulletins?

Never mind that. ABI kindly gave me permission to post some of their data (2):

compare_callers.gif

These box and whisker plots show the results from chromatograms that were basecalled with the KB basecaller (on top, in blue), chromatograms from ABI instruments (without KB) that were re-processed by phred (in the middle, in red), and chromatograms that were first processed with KB, and then with phred (green, on the bottom) (this was the method that I used the other day with my one chromatogram).

In each case, they compared the read sequences that were obtained with a reference sequence in order to determine the error rate.

(What is a read? A read is a DNA sequence that's been obtained from a chromatogram file. The chromatogram file has lots of extra information like the kind of matrix, the run time, the name of the base calling program, the peak heights, etc. A read sequence only contains the sequence of bases: ATAGAGCTCATCGATCATCTACGTA.... etc. )

We can evaluate reads in a few ways.

  • We can look at the number of high quality bases (Q20, Q30, Q40).
  • We can look at the length of the read after trimming off the bad stuff.
  • And, we can compare the read to a known sequence and count the number of differences.

Part A in the figure shows the length of the read sequence after trimming the poor quality data (less than Q20) bases from both the 5' and 3' ends. In each case, it appears that the KB base caller gave longer reads. In this figure, it looks like the mean values were around 650, 775, and 950 bases for reads from short, medium, and long runs.

Part B shows the error rates. For the rapid runs (top), it looks like phred has a slightly lower mean error rate when it's used to re-process KB-called data. KB and re-processed KB data appear to be tied for the medium length runs and KB wins with the long runs.

To quote ABI: .

..since phred replaces (and ignores) the initial called sequence, re-processing KB-analyzed samples with phred will, on average, degrade the accuracy of the analysis in terms of actual sequence error. Analysis improvements provided by KB algorithm outlined above will be essentially lost.

There you have it, the end of this read and this sequence of posts at the same time. Time to move on to the next generation.

Reference:
1. Gehman, C. et. al. 2004 "Longer Reads with the KB Basecaller" AGBT 2004.
2. Applied Biosystems User Bulletin, FAQ KB Basecaller v1.2.

Share on Facebook
Share on StumbleUpon
Share on Facebook
Find more posts in: Life ScienceMedicine & Health

Comments

1

Figure 1 looks like it would be a nightmare for a colorblind reader.

Posted by: RPM | August 21, 2007 10:24 AM

2

I agree.

Posted by: Sandra Porter | August 21, 2007 10:55 AM

3

Thanks for the update Sandra. But shouldn't there also be a comparison of phred analysis of raw data compared to KB?
regards,
TJK

Posted by: Thomas Keller | August 22, 2007 12:23 PM

4

Hi Sandra

It is good to see someone looking critically at DNA basecallers. KB is certainly a better basecaller than phred, however, there are better basecallers out there than KB. At risk of tooting my own horn, our company sells a couple of basecallers (LongTrace and PeakTrace) that are better basecallers than either phred or KB. We have a free versions of the software on our website which you can try with your own traces - the links are below.

http://www.nucleics.com/peaktrace-sequencing/
http://www.nucleics.com/longtrace-sequencing/

Cheers

Daniel

Posted by: Daniel Tillett | August 22, 2007 7:24 PM

5

Hi Thomas,

You wrote: "shouldn't there also be a comparison of phred analysis of raw data compared to KB?"

Unfortunately, this can't be done. Phred can only work with data that have been previously processed by a sequencing instrument. The closest you can get to doing the experiment that you described is having phred work with data that have been processed on ABI instruments with base callers other than KB.

Posted by: Sandra Porter | August 22, 2007 11:22 PM

6

Thanks Daniel,

I'll take a look.

Posted by: Sandra Porter | August 22, 2007 11:24 PM

7

Hi Daniel,

i am wondering what is the difference between LongTrace and PeakTrace. I am using LongTrace for quite sometime, and tried out trail version of PeakTrace as well. For sure, both give improved result, yet i learn that LongTrace gives better result that PeakTrace. How do you think about this?

Cheers

Nic

Posted by: Nic | March 18, 2010 2:17 AM

8

Hi Nic

The basic difference is PeakTrace is a basecaller and trace processor combined, while LongTrace is just a trace pre-processor for the KB basecaller. PeakTrace is better than LongTrace in my opinion, but with some trace types KB does better. This is the reason why we still off both versions.

Cheers

Daniel

Posted by: Daniel Tillett | September 2, 2010 12:45 AM

Post a Comment

(Email is required for authentication purposes only. On some blogs, comments are moderated for spam, so your comment may not appear immediately.)





ScienceBlogs

Search ScienceBlogs:

Go to:

Advertisement
Follow ScienceBlogs on Twitter

© 2006-2011 ScienceBlogs LLC. ScienceBlogs is a registered trademark of ScienceBlogs LLC. All rights reserved.