Sometimes asking a question can be a mistake.
Especially when your question leads to more questions and having to question things that you didn't want to question, and pretty soon you begin to regret ever opening the file and looking at the data and asking the question in the first place.
Sigh. Take a deep breath.
Yesterday through a twist of fate, I ended up taking a look at the DNA sequences produced by two different base calling programs from the same chromatogram file, from an ABI 3730 DNA sequencing instrument. I thought they would be the same, or at least similar.
One of the base-calling programs, phred, is well-established in the DNA sequencing community. Phred was used for base-calling much of the original human genome sequence. However, it's been about 6 years since phred was last updated and it hasn't been evaluated thoroughly with newer sequencing instruments and chemistries.
In recent years, ABI developed a new base calling program, called "KB" that they sell with their DNA sequencing instruments.
I really don't know which program is better or which program people prefer, I only know that the ABI program probably gets updated for new instruments and that it's been a long time since phred has been changed.
So, it came as a complete surprise to me to find that the quality data (and the DNA sequences) produced by the two programs, from the same chromatogram, could be so different.
A call for help
Luckily, there are many intelligent and thoughtful people who read this blog and also, many intellignent people who belong to the ABRF (Association of Biomedical Resource Facilities - i.e. core labs). So I called on my both readers and the ABRF listserve members for help. I got it. The commenters gave me some great advice and I took it.
A little background
I'm not going to describe everything about base-calling programs here. If you want to read more about quality values and what they mean, you can find a general description here. The point that I want to stress right now, is where the base calling programs work.
The KB base calling program works inside the DNA sequencing instrument to identify peaks and assign quality values. This is kind of a pain (I think) because you can't take chromatograms from other instruments and process them with KB to measure quality. Unfortunately, this kind of original chromatogram data can be hard to obtain. NCBI doesn't provide this kind of data in the trace archive and journals don't require that this kind of data be made available anywhere.
Phred, however, works with data after it's been stored in a chromatogram file by the sequencing instrument. This is nice, because you can base-call all your data with the same program and get comparable quality values. At least that's what I thought.
[The quality data and "traces" that you get from 454 flowgrams are a completely different thing and a subject for another day.]
What I learned by following advice and looking at my data:
First, as my commenter suggested, there are dye blobs at the beginning of the sequence. (Those are the very tall peaks at the beginning of the graph.)
2. Second, the overall signal strength might be a bit low - I had to increase the vertical height in FinchTV to see the peaks appear more clearly. The signal strengths were: A = 231, C = 194, G = 201, and T = 121.
Still both KB and phred were interpreting the same trace, although differently. If I compare the trace from KB and from phred, I can line up the peaks and see that the two plots are basically the same.
3. Third, phred appears to be making a lot of mistakes. For some reason, phred missed many of the A's and G's. I went through and counted all the base calls (in this stretch) that disagreed with the trace (light blue, below) and I found that 44 out of 75 (or 59%) of the base calls disagreed between the phred base call and the peak that was shown in the trace. The color of the peak corresponds should correspond to the base call (Green = A, Blue = C, Red = T, Black = G).
In contrast, the bases that are called by the KB basecaller are consistent with the peaks that I see in the trace. Wherever there's a black peak, for instance, I see a "g," wherever there's a green peak, I see an "a." This is how it should work.
One thing is good though, at least phred knew that it wasn't doing a good job with identifying the bases. Phred gave all these bases low quality scores. (A low quality score indicates a higher probability of a mistake). You can see in the graph that all the quality scores are below the blue line. (The blue line at the top marks the Q20 point, where only 1 base in 100 would be expected to be miscalled).
So, even if phred failed to identify the bases, phred was correct in assessing it's ability to identify them. It said there was a higher probability of miscalling bases and it was right.
(Sorry about anthropomorphizing the algorithm, of course the program doesn't know that it's making a mistake. It's just easier for me to describe it that way.)
What's the take home lesson?
1. I'm certainly grateful that are smart people who read this blog!
2. I really, really, really wish I could get original chromatogram data from the NCBI trace archive, so that I could either look at the original base calls or mess with the base callers and work out a good way to process it myself.
3. In cases where I have data from a 3730 DNA sequencing instrument, I might be more inclined to use the KB base calls. At least until I look at some more data and have a better feeling for it.
4. And - yes, when in doubt, I will always take a closer look at the traces.
Very interesting, and very relevant to some of my day work.
One thing to check: are you using the right chemistry files with phred? We just have one set that we leave-and-forget, but I know there are such things and they make a difference?
Also, have you contacted Phil Green (phrap's author) about this? It would be interesting to see his take on it.
One thing to check: are you using the right chemistry files with phred?
It's not that simple. The last version of phred came out in 2002. The ABI 3730 DNA sequencing instrument and the first version of the KB basecaller came out in 2003, so phred has never been "trained" with 3730 data.
I ended up investigating this in 2003 because we were hearing from our customers that phred was miscalling G's every now and then and saying that there were two G's when the trace only showed one.
At that time, I did test using different chemistries with the phredpar.dat file and I found that we could fix the problem by adding a new setting.
We also modified our Finch software so that our customers could have the option to use chromatogram data that had been base-called by the KB base caller, rather than reprocessing everything phred.
What makes this all difficult, is that phred doesn't work on the original chromatogram data. Phred processes data after it's already been processed by the KB base caller. My solution worked for the 2003 version of the KB base caller, but since then there's been at least three different versions of KB. To really make phred work reliably, you would have to re-train Phred every time a different version of KB comes on-line. And then, you'd have to add some kind of logic to phred so that it used the correct mobility information for each version of KB.
We've decided training phred every year is not the effort.
For the experiment that I referenced here, I would have just used the original chromatogram and not even bothered comparing the two base callers, if I hadn't ended up comparing the two by accident and encountered such a dramatic difference.
As far as whether or not we've asked Phil about this, I don't think that there's much point. The last study that Brent (the author of the phred paper) did was in 2000. It seems likely to us that, with new sequencing technologies coming on-line, and the human genome complete, phred development and updates have come to an end.
"4. And - yes, when in doubt, I will always take a closer look at the traces."
In my opinion, the *first* thing you should do, before analyzing any sequences, is look at the chromatograms. This will tell you whether it is worth your time to analyze sequence, or whether something is wrong with the raw data. It just takes a few seconds to gut check a chromatogram with base calls.
I partly agree. However, when you're looking at thousands of chromatograms, there are better statistical tools for determining if the data are likely to be suitable or not.
I wrote a paper (in pdf form) about this that shows one example, although we have many, many more.
I should add, too, that phred is such a commonly-used tool that it simply never occurred to me that it no longer worked well.
If I had only looked at the trace from phred, I would have just decided that the phred-called chromatograms weren't any good (which would be consistent with the trace data) and I would have thrown them away.
I completely understand when you say: "I really, really, really wish I could get original chromatogram data from the NCBI trace archive, so that I could either look at the original base calls or mess with the base callers and work out a good way to process it myself." I'd like to suggest that a somewhat analogous exists with genotyping data. Consider SNP genotyping: most people seem to accept that a genotype is a genotype, period, end of story. We don't really fret much about the fact that many (most?) commercial SNP genotyping platforms are essentially a black box where DNA goes in and a genotype comes out and who knows what goes on in between. One pretty much just has to trust that the (often proprietary) vendor software did the normalization and summarization (if there are redundant probes) and genotype clustering/calling correctly, and that the lab set appropriate QC thresholds (it's not uncommon for water to have a genotype!). That's not even considering the lower-level tasks of finding the (extremely small and getting smaller) edges of the spot and calculating the intensity (and sometimes color) and so on.
But anybody who's spent time calling genotype clusters (or, more recently, trying to identify CNVs from the same dataset) knows how problematic that can be. And now as typical GWA experiments have grown to several thousand samples and 500K+ SNPs -- potentially plus thousands more non-polymorphic CNV probes -- it's no longer possible for a human to review even a fraction of the results and we're forced to rely even more on that vendor-supplied "black box." So it's even more important that (1) we put in multiple layers of QC controls and checks and filter out truly unreliable data and (2) that we retain the primary data so that we can go back and check/compare/recall etc. whenever questions arise or vendor software changes or new & better methods become available as they inevitably will.
Of course this begs the question: what is the primary data? If it's the raw image data then we're in big trouble because it's so large (often many gigabytes per sample) that nobody seems to have a good plan for how to properly archive it, or to archive it at all. I suspect that most labs are simply deleting it as soon as the derived data files are generated (e.g., CEL/Affy or IDAT/Illumina). Evidently even NCBI has no current plans to accept raw data as, for instance, part of dbGaP deposits. Given the experience of other science fields wherein major discoveries relied on going back to old primary data (e.g., the "ozone hole" story - it was thought to be nothing more than noise until new methods applied to data from years past proved otherwise), we'd be fools NOT to preserve this data. Especially since the experiments have gotten so expensive, usually hundreds of thousands or even millions of dollars each, and often are using up the very last dregs of precious, irreplaceable DNA samples.
I won't even start on what to do about the terabytes/run of short-read sequence data from the next-gen systems that's about to deluge us, which mostly has people terrified - at least the informatics types like myself!
I agree 100% with what Lee says. Not keeping the raw data is almost criminal - millions of dollars of taxpayers dollars was spent collecting it and for a few thousand extra in storage costs we just throw it all away.
This is more than just a theoretically problem. Our company (Nucleics) sells a couple of DNA basecallers (LongTrace and PeakTrace) that can extract up to 30% more high quality bases from each read, but only by using the original raw trace data. If NCBI stored the original data then people would be able to go back into the database, extract the traces, and use our software to obtain more data from the reads and get better assemblies.
I got so angry about this waste I wrote a little program to convert the .ab1 trace files into the standard .ztr format, but which allows this new .ztr file to be converted back to the original .ab1 (basically all the raw data is stored in a custom field of the ztf file). This gives a compressed trace file (a 300kb .ab1 file is compressed down to a 80kb .ztr file) that would allow NCBI to store the original .ab1 file for very little extra in storage costs. I did contact NCBI about giving them this software (for free), but I never heard anything back from them. Lee has inspired me to contact them again :-)