What do you do when base-callers disagree?
Okay DNA sequencing community, I want your help with this one. One of these sequences was called by phred and the other by the ABI KB base calling program.
Which one should I believe?
Sometimes I open up files and do short experiments just because - well, I'm curious. And sometimes I immediately wish I hadn't done that because what I opened looks like a larger can of worms than I really want to see.
These graphs show the quality of each base, in a DNA sequence, on the y axis and the position of that base on the X axis. For phred, a quality value corresponds a probability of a base being correctly identified. A quality value of 20, means a 1% chance that the base is wrong, 30 corresponds to 0.1% chance that's been called incorrectly, and 40, means a one in 10,000 chance that the base has been misidentified. People accept values around 20, but want values around 40. (read more about phred)
These graphs were generated from the same chromatogram file, but processed by different base calling programs.
I won't tell you which graph was produced by which base caller, but the chromatogram was obtained in 2006 from an ABI 3730 DNA sequencing instrument.
In theory, these graphs should be identical, or at least very similar, but unfortunately, I'm not sure which of these graphs is the one that I should believe. One of the base callers is considerably more optimistic, quality-wise than than the other.
So I'm asking you. What are you using for base callers these days? How are you checking the accuracy of your data?
do you know the answer? I want to know the answer.
I don't know the answer although I suspect most groups are not comparing qualities at all and either using what they've always used or using the base calling program that comes with the DNA sequencing instrument.
I'm hoping someone will chime in and tell us that they've done some experiments, so I don't have to do them.
Even I do not have a solution. One way I solved such quality problems was to mostly duplicately sequence the clone and merge the two sequences using Phrap and used the Phrap quality scores for determining the sequence quality.
I'd open up the chromatogram file and take a look myself. It's the most important sanity-check you can do.
It's clear to me that at least one of the programs has simply failed to call the sequence properly. I bet if you align the two sequences, you'll find little or no similarity. If so, that should be obvious when you eyeball the chromatogram with the base calls overlaid.
Interestingly, the top trace is about double the length of the bottom one, and the only bases with a reasonable quality score are right at the start. A general rule of thumb is that the first few bases of any sequence are never any good - it's where you get all the dye blobs and other crud. So I would be less inclined to trust the top sequence, but wouldn't go further than that without looking at the trace.
I've seen this happen before though. The general pattern seems to be that the peak-finding algorithm gets misled by the noisy dye-blobby area at the start, and sets the expected spacing between bases wrong. That means it's unable to call the majority of the sequence, as it's looking for the peaks in the wrong place. This is again consistent with the first trace being much longer than the second.
sequence something you know the answer to. compare your base called results to that - and count the 'errors'. the correct bascaller should have the fewest errors - and the Q scores should match the observed error frequencies.
The best advice I can tell you is that both algorithms are probably suffice for most applications. Peter was correct to state that sometimes the algorithm get "started off crudely" because of the initial 10-20 bp. The best thing to do is to align the sequence with a "template" sequence from GenBank if you have one (make sure and delete the beginning and end of the read...makes for better alignment). Then you can actually see the differences and systematically check the errors. If you don't have a template to compare against, use either algorithm and check the ambiguous reads yourself.
The algorithms differ on the way they chose the confidence to assign the peak. The code takes into account the bases and confidence both before and after each base it's assigning a value to. But like I said...both should be fine.
Well, I can't sequence this sample again because this sequence comes from an environmental sequencing project and the bacterial colonies have long since been autoclaved and consigned to the rubbish heap.
But, I did take a closer look at the traces and did find some differences.
I'll post an update later today and put a link from here.
phred is easily thrown off by a large peak early in the chromatogram - really large signal, usually a contaminant. Also, it has difficulty with short reads of < 120 or so. There aren't many parameters you can tweak to help phred. Here is one otion you can include when you run phred:
"Disable phred trace normalization. This
option is not recommended unless the base
caller fails due to huge noise peaks
extending over a large region at the start
of the trace, as is characteristic of some
dye terminator reactions."
This does help quite a bit, but isn't a cure. Use with caution or use KB. KB is more robust in these cases.
Sorry - I got snipped. -nonorm is an option you can use when you call phred. Should not be used in all phred calls. Just the problematic ones where it may help.
Thanks Brad, Peter, Loc, C, Vidhya! I posted screen shots from the chromatograms and my conclusion here.
I am not clear.can anyone explain it clearly.