Will the real DNA sequence please stand up?

By sporte on August 15, 2007.

Sometimes asking a question can be a mistake.

Especially when your question leads to more questions and having to question things that you didn't want to question, and pretty soon you begin to regret ever opening the file and looking at the data and asking the question in the first place.

Sigh. Take a deep breath.

Yesterday through a twist of fate, I ended up taking a look at the DNA sequences produced by two different base calling programs from the same chromatogram file, from an ABI 3730 DNA sequencing instrument. I thought they would be the same, or at least similar.

tags: DNA sequencing, DNA , base-calling programs

One of the base-calling programs, phred, is well-established in the DNA sequencing community. Phred was used for base-calling much of the original human genome sequence. However, it's been about 6 years since phred was last updated and it hasn't been evaluated thoroughly with newer sequencing instruments and chemistries.

In recent years, ABI developed a new base calling program, called "KB" that they sell with their DNA sequencing instruments.

I really don't know which program is better or which program people prefer, I only know that the ABI program probably gets updated for new instruments and that it's been a long time since phred has been changed.

So, it came as a complete surprise to me to find that the quality data (and the DNA sequences) produced by the two programs, from the same chromatogram, could be so different.

A call for help
Luckily, there are many intelligent and thoughtful people who read this blog and also, many intellignent people who belong to the ABRF (Association of Biomedical Resource Facilities - i.e. core labs). So I called on my both readers and the ABRF listserve members for help. I got it. The commenters gave me some great advice and I took it.

A little background
I'm not going to describe everything about base-calling programs here. If you want to read more about quality values and what they mean, you can find a general description here. The point that I want to stress right now, is where the base calling programs work.

The KB base calling program works inside the DNA sequencing instrument to identify peaks and assign quality values. This is kind of a pain (I think) because you can't take chromatograms from other instruments and process them with KB to measure quality. Unfortunately, this kind of original chromatogram data can be hard to obtain. NCBI doesn't provide this kind of data in the trace archive and journals don't require that this kind of data be made available anywhere.

Phred, however, works with data after it's been stored in a chromatogram file by the sequencing instrument. This is nice, because you can base-call all your data with the same program and get comparable quality values. At least that's what I thought.

[The quality data and "traces" that you get from 454 flowgrams are a completely different thing and a subject for another day.]

What I learned by following advice and looking at my data:

1. I used FinchTV (free from Geospiza) to look at the traces from the chromatogram file called with KB and the SCF file that was produced by phred and found some interesting things.

First, as my commenter suggested, there are dye blobs at the beginning of the sequence. (Those are the very tall peaks at the beginning of the graph.)

2. Second, the overall signal strength might be a bit low - I had to increase the vertical height in FinchTV to see the peaks appear more clearly. The signal strengths were: A = 231, C = 194, G = 201, and T = 121.

Still both KB and phred were interpreting the same trace, although differently. If I compare the trace from KB and from phred, I can line up the peaks and see that the two plots are basically the same.

3. Third, phred appears to be making a lot of mistakes. For some reason, phred missed many of the A's and G's. I went through and counted all the base calls (in this stretch) that disagreed with the trace (light blue, below) and I found that 44 out of 75 (or 59%) of the base calls disagreed between the phred base call and the peak that was shown in the trace. The color of the peak corresponds should correspond to the base call (Green = A, Blue = C, Red = T, Black = G).

In contrast, the bases that are called by the KB basecaller are consistent with the peaks that I see in the trace. Wherever there's a black peak, for instance, I see a "g," wherever there's a green peak, I see an "a." This is how it should work.

Click to view a larger image

One thing is good though, at least phred knew that it wasn't doing a good job with identifying the bases. Phred gave all these bases low quality scores. (A low quality score indicates a higher probability of a mistake). You can see in the graph that all the quality scores are below the blue line. (The blue line at the top marks the Q20 point, where only 1 base in 100 would be expected to be miscalled).

So, even if phred failed to identify the bases, phred was correct in assessing it's ability to identify them. It said there was a higher probability of miscalling bases and it was right.

(Sorry about anthropomorphizing the algorithm, of course the program doesn't know that it's making a mistake. It's just easier for me to describe it that way.)

What's the take home lesson?
1. I'm certainly grateful that are smart people who read this blog!

2. I really, really, really wish I could get original chromatogram data from the NCBI trace archive, so that I could either look at the original base calls or mess with the base callers and work out a good way to process it myself.

3. In cases where I have data from a 3730 DNA sequencing instrument, I might be more inclined to use the KB base calls. At least until I look at some more data and have a better feeling for it.

4. And - yes, when in doubt, I will always take a closer look at the traces.

More like this

Very interesting, and very relevant to some of my day work.

One thing to check: are you using the right chemistry files with phred? We just have one set that we leave-and-forget, but I know there are such things and they make a difference?

Also, have you contacted Phil Green (phrap's author) about this? It would be interesting to see his take on it.

One thing to check: are you using the right chemistry files with phred?

It's not that simple. The last version of phred came out in 2002. The ABI 3730 DNA sequencing instrument and the first version of the KB basecaller came out in 2003, so phred has never been "trained" with 3730 data.

I ended up investigating this in 2003 because we were hearing from our customers that phred was miscalling G's every now and then and saying that there were two G's when the trace only showed one.

At that time, I did test using different chemistries with the phredpar.dat file and I found that we could fix the problem by adding a new setting.

We also modified our Finch software so that our customers could have the option to use chromatogram data that had been base-called by the KB base caller, rather than reprocessing everything phred.

What makes this all difficult, is that phred doesn't work on the original chromatogram data. Phred processes data after it's already been processed by the KB base caller. My solution worked for the 2003 version of the KB base caller, but since then there's been at least three different versions of KB. To really make phred work reliably, you would have to re-train Phred every time a different version of KB comes on-line. And then, you'd have to add some kind of logic to phred so that it used the correct mobility information for each version of KB.

We've decided training phred every year is not the effort.

For the experiment that I referenced here, I would have just used the original chromatogram and not even bothered comparing the two base callers, if I hadn't ended up comparing the two by accident and encountered such a dramatic difference.

As far as whether or not we've asked Phil about this, I don't think that there's much point. The last study that Brent (the author of the phred paper) did was in 2000. It seems likely to us that, with new sequencing technologies coming on-line, and the human genome complete, phred development and updates have come to an end.

"4. And - yes, when in doubt, I will always take a closer look at the traces."

In my opinion, the *first* thing you should do, before analyzing any sequences, is look at the chromatograms. This will tell you whether it is worth your time to analyze sequence, or whether something is wrong with the raw data. It just takes a few seconds to gut check a chromatogram with base calls.

I partly agree. However, when you're looking at thousands of chromatograms, there are better statistical tools for determining if the data are likely to be suitable or not.

I wrote a paper (in pdf form) about this that shows one example, although we have many, many more.

I should add, too, that phred is such a commonly-used tool that it simply never occurred to me that it no longer worked well.

If I had only looked at the trace from phred, I would have just decided that the phred-called chromatograms weren't any good (which would be consistent with the trace data) and I would have thrown them away.

I completely understand when you say: "I really, really, really wish I could get original chromatogram data from the NCBI trace archive, so that I could either look at the original base calls or mess with the base callers and work out a good way to process it myself." I'd like to suggest that a somewhat analogous exists with genotyping data. Consider SNP genotyping: most people seem to accept that a genotype is a genotype, period, end of story. We don't really fret much about the fact that many (most?) commercial SNP genotyping platforms are essentially a black box where DNA goes in and a genotype comes out and who knows what goes on in between. One pretty much just has to trust that the (often proprietary) vendor software did the normalization and summarization (if there are redundant probes) and genotype clustering/calling correctly, and that the lab set appropriate QC thresholds (it's not uncommon for water to have a genotype!). That's not even considering the lower-level tasks of finding the (extremely small and getting smaller) edges of the spot and calculating the intensity (and sometimes color) and so on.

But anybody who's spent time calling genotype clusters (or, more recently, trying to identify CNVs from the same dataset) knows how problematic that can be. And now as typical GWA experiments have grown to several thousand samples and 500K+ SNPs -- potentially plus thousands more non-polymorphic CNV probes -- it's no longer possible for a human to review even a fraction of the results and we're forced to rely even more on that vendor-supplied "black box." So it's even more important that (1) we put in multiple layers of QC controls and checks and filter out truly unreliable data and (2) that we retain the primary data so that we can go back and check/compare/recall etc. whenever questions arise or vendor software changes or new & better methods become available as they inevitably will.

Of course this begs the question: what is the primary data? If it's the raw image data then we're in big trouble because it's so large (often many gigabytes per sample) that nobody seems to have a good plan for how to properly archive it, or to archive it at all. I suspect that most labs are simply deleting it as soon as the derived data files are generated (e.g., CEL/Affy or IDAT/Illumina). Evidently even NCBI has no current plans to accept raw data as, for instance, part of dbGaP deposits. Given the experience of other science fields wherein major discoveries relied on going back to old primary data (e.g., the "ozone hole" story - it was thought to be nothing more than noise until new methods applied to data from years past proved otherwise), we'd be fools NOT to preserve this data. Especially since the experiments have gotten so expensive, usually hundreds of thousands or even millions of dollars each, and often are using up the very last dregs of precious, irreplaceable DNA samples.

I won't even start on what to do about the terabytes/run of short-read sequence data from the next-gen systems that's about to deluge us, which mostly has people terrified - at least the informatics types like myself!

Thanks Lee,

well said!

I agree 100% with what Lee says. Not keeping the raw data is almost criminal - millions of dollars of taxpayers dollars was spent collecting it and for a few thousand extra in storage costs we just throw it all away.

This is more than just a theoretically problem. Our company (Nucleics) sells a couple of DNA basecallers (LongTrace and PeakTrace) that can extract up to 30% more high quality bases from each read, but only by using the original raw trace data. If NCBI stored the original data then people would be able to go back into the database, extract the traces, and use our software to obtain more data from the reads and get better assemblies.

I got so angry about this waste I wrote a little program to convert the .ab1 trace files into the standard .ztr format, but which allows this new .ztr file to be converted back to the original .ab1 (basically all the raw data is stored in a custom field of the ztf file). This gives a compressed trace file (a 300kb .ab1 file is compressed down to a 80kb .ztr file) that would allow NCBI to store the original .ab1 file for very little extra in storage costs. I did contact NCBI about giving them this software (for free), but I never heard anything back from them. Lee has inspired me to contact them again :-)

Cheers

Daniel

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

Universities Can Agree On All Hate Speech Except Antisemitism

More by this author

New home for Discovering Biology in a Digital World

October 30, 2017

Sometime in the next day or two, Scienceblogs will shut down. We've enjoyed the opportunity to blog here for the past 10+ years. Not to worry, @digitalbio and @finchtalk will continue blogging, but more so from their own site at Digital World Biology. The Scienceblogs posts have been…

Synbiobeta: The Future is Now

October 12, 2017

@synbiobeta concluded it’s #sbbsf17 annual meeting on synthetic biology Oct 5, 2017. The progress companies are making in harnessing biology as a platform for manufacturing and problem solving is world changing. Locations of Synbio Companies What is Synthetic Biology? Synthetic biology is a term…

Understanding the CRISPR Cas9 system

September 18, 2016

On Sept. 30th, I'm going to be co-presenting a Bio-Link webinar on Genome Engineering with CRISPR-Cas9 with Dr. Thomas Tubon from Madison College. If you're interested, Register here. Since my part will be to help our audience understand the basics of this system, I prepared a…

Zika virus, drug discovery, and student projects

March 8, 2016

It's well understood in science education that students are more engaged when they work on problems that matter. Right now, Zika virus matters. Zika is a very scary problem that matters a great deal to anyone who might want to start a family and greatly concerns my students. I…

DNA: it's in your blood

February 28, 2016

Did you know small fragments of DNA are circulating in your blood stream? These short pieces of DNA are left behind after cells self-destruct. This self-destruction, or apoptosis, is a normal process. In the case of fetal development, certain cells in our hands die, leaving behind individual…