If you've read the previous posts on this topic, here and here, you're probably aware by now that I have this weird (okay, maybe fanatical) obsession with data. Or at least, with knowing if my data are right so I can get on with life, do the analysis and figure out the results.
My results from last week suggested that re-processing chromatogram data (from the ABI 3730) with phred was probably a bad idea, but still, I only had one data point and I really wanted to know if anyone had done a more thorough study and compared larger numbers of chromatograms.
Naturally, someone had.
And of course it was ABI. And, the results aren't even new (except to me, I guess).
ABI and their collaborators at the Washington University and Baylor College of Medicine genome centers presented this work in a poster at the Advances in Genome Biology and Technology (AGBT) meeting in 2004 at Marco Island (1).
They looked at basecalling performance with data from 20,000 chromatograms and concluded that:
- 1. KB produced fewer errors.
- 2. KB was able to call more bases, which resulted in longer reads.
It certainly puts my quick conclusion from one chromatogram to shame. Oh, why oh why don't I ever read those user bulletins?
Never mind that. ABI kindly gave me permission to post some of their data (2):
These box and whisker plots show the results from chromatograms that were basecalled with the KB basecaller (on top, in blue), chromatograms from ABI instruments (without KB) that were re-processed by phred (in the middle, in red), and chromatograms that were first processed with KB, and then with phred (green, on the bottom) (this was the method that I used the other day with my one chromatogram).
In each case, they compared the read sequences that were obtained with a reference sequence in order to determine the error rate.
(What is a read? A read is a DNA sequence that's been obtained from a chromatogram file. The chromatogram file has lots of extra information like the kind of matrix, the run time, the name of the base calling program, the peak heights, etc. A read sequence only contains the sequence of bases: ATAGAGCTCATCGATCATCTACGTA.... etc. )
We can evaluate reads in a few ways.
- We can look at the number of high quality bases (Q20, Q30, Q40).
- We can look at the length of the read after trimming off the bad stuff.
- And, we can compare the read to a known sequence and count the number of differences.
Part A in the figure shows the length of the read sequence after trimming the poor quality data (less than Q20) bases from both the 5' and 3' ends. In each case, it appears that the KB base caller gave longer reads. In this figure, it looks like the mean values were around 650, 775, and 950 bases for reads from short, medium, and long runs.
Part B shows the error rates. For the rapid runs (top), it looks like phred has a slightly lower mean error rate when it's used to re-process KB-called data. KB and re-processed KB data appear to be tied for the medium length runs and KB wins with the long runs.
To quote ABI: .
..since phred replaces (and ignores) the initial called sequence, re-processing KB-analyzed samples with phred will, on average, degrade the accuracy of the analysis in terms of actual sequence error. Analysis improvements provided by KB algorithm outlined above will be essentially lost.
There you have it, the end of this read and this sequence of posts at the same time. Time to move on to the next generation.
Figure 1 looks like it would be a nightmare for a colorblind reader.
Thanks for the update Sandra. But shouldn't there also be a comparison of phred analysis of raw data compared to KB?
It is good to see someone looking critically at DNA basecallers. KB is certainly a better basecaller than phred, however, there are better basecallers out there than KB. At risk of tooting my own horn, our company sells a couple of basecallers (LongTrace and PeakTrace) that are better basecallers than either phred or KB. We have a free versions of the software on our website which you can try with your own traces - the links are below.
You wrote: "shouldn't there also be a comparison of phred analysis of raw data compared to KB?"
Unfortunately, this can't be done. Phred can only work with data that have been previously processed by a sequencing instrument. The closest you can get to doing the experiment that you described is having phred work with data that have been processed on ABI instruments with base callers other than KB.
I'll take a look.
i am wondering what is the difference between LongTrace and PeakTrace. I am using LongTrace for quite sometime, and tried out trail version of PeakTrace as well. For sure, both give improved result, yet i learn that LongTrace gives better result that PeakTrace. How do you think about this?
The basic difference is PeakTrace is a basecaller and trace processor combined, while LongTrace is just a trace pre-processor for the KB basecaller. PeakTrace is better than LongTrace in my opinion, but with some trace types KB does better. This is the reason why we still off both versions.