Sometimes asking a question can be a mistake.
Especially when your question leads to more questions and having to question things that you didn’t want to question, and pretty soon you begin to regret ever opening the file and looking at the data and asking the question in the first place.
Sigh. Take a deep breath.
Yesterday through a twist of fate, I ended up taking a look at the DNA sequences produced by two different base calling programs from the same chromatogram file, from an ABI 3730 DNA sequencing instrument. I thought they would be the same, or at least similar.
One of the base-calling programs, phred, is well-established in the DNA sequencing community. Phred was used for base-calling much of the original human genome sequence. However, it’s been about 6 years since phred was last updated and it hasn’t been evaluated thoroughly with newer sequencing instruments and chemistries.
In recent years, ABI developed a new base calling program, called “KB” that they sell with their DNA sequencing instruments.
I really don’t know which program is better or which program people prefer, I only know that the ABI program probably gets updated for new instruments and that it’s been a long time since phred has been changed.
So, it came as a complete surprise to me to find that the quality data (and the DNA sequences) produced by the two programs, from the same chromatogram, could be so different.
A call for help
Luckily, there are many intelligent and thoughtful people who read this blog and also, many intellignent people who belong to the ABRF (Association of Biomedical Resource Facilities – i.e. core labs). So I called on my both readers and the ABRF listserve members for help. I got it. The commenters gave me some great advice and I took it.
A little background
I’m not going to describe everything about base-calling programs here. If you want to read more about quality values and what they mean, you can find a general description here. The point that I want to stress right now, is where the base calling programs work.
The KB base calling program works inside the DNA sequencing instrument to identify peaks and assign quality values. This is kind of a pain (I think) because you can’t take chromatograms from other instruments and process them with KB to measure quality. Unfortunately, this kind of original chromatogram data can be hard to obtain. NCBI doesn’t provide this kind of data in the trace archive and journals don’t require that this kind of data be made available anywhere.
Phred, however, works with data after it’s been stored in a chromatogram file by the sequencing instrument. This is nice, because you can base-call all your data with the same program and get comparable quality values. At least that’s what I thought.
[The quality data and "traces" that you get from 454 flowgrams are a completely different thing and a subject for another day.]
What I learned by following advice and looking at my data:
First, as my commenter suggested, there are dye blobs at the beginning of the sequence. (Those are the very tall peaks at the beginning of the graph.)
2. Second, the overall signal strength might be a bit low – I had to increase the vertical height in FinchTV to see the peaks appear more clearly. The signal strengths were: A = 231, C = 194, G = 201, and T = 121.
Still both KB and phred were interpreting the same trace, although differently. If I compare the trace from KB and from phred, I can line up the peaks and see that the two plots are basically the same.
3. Third, phred appears to be making a lot of mistakes. For some reason, phred missed many of the A’s and G’s. I went through and counted all the base calls (in this stretch) that disagreed with the trace (light blue, below) and I found that 44 out of 75 (or 59%) of the base calls disagreed between the phred base call and the peak that was shown in the trace. The color of the peak corresponds should correspond to the base call (Green = A, Blue = C, Red = T, Black = G).
In contrast, the bases that are called by the KB basecaller are consistent with the peaks that I see in the trace. Wherever there’s a black peak, for instance, I see a “g,” wherever there’s a green peak, I see an “a.” This is how it should work.
One thing is good though, at least phred knew that it wasn’t doing a good job with identifying the bases. Phred gave all these bases low quality scores. (A low quality score indicates a higher probability of a mistake). You can see in the graph that all the quality scores are below the blue line. (The blue line at the top marks the Q20 point, where only 1 base in 100 would be expected to be miscalled).
So, even if phred failed to identify the bases, phred was correct in assessing it’s ability to identify them. It said there was a higher probability of miscalling bases and it was right.
(Sorry about anthropomorphizing the algorithm, of course the program doesn’t know that it’s making a mistake. It’s just easier for me to describe it that way.)
What’s the take home lesson?
1. I’m certainly grateful that are smart people who read this blog!
2. I really, really, really wish I could get original chromatogram data from the NCBI trace archive, so that I could either look at the original base calls or mess with the base callers and work out a good way to process it myself.
3. In cases where I have data from a 3730 DNA sequencing instrument, I might be more inclined to use the KB base calls. At least until I look at some more data and have a better feeling for it.
4. And – yes, when in doubt, I will always take a closer look at the traces.