What is the truth in DNA sequencing?

By sporte on August 14, 2007.

What do you do when base-callers disagree?

Okay DNA sequencing community, I want your help with this one. One of these sequences was called by phred and the other by the ABI KB base calling program.

Which one should I believe?

tags: DNA sequencing, DNA , base-calling programs

Sometimes I open up files and do short experiments just because - well, I'm curious. And sometimes I immediately wish I hadn't done that because what I opened looks like a larger can of worms than I really want to see.

These graphs show the quality of each base, in a DNA sequence, on the y axis and the position of that base on the X axis. For phred, a quality value corresponds a probability of a base being correctly identified. A quality value of 20, means a 1% chance that the base is wrong, 30 corresponds to 0.1% chance that's been called incorrectly, and 40, means a one in 10,000 chance that the base has been misidentified. People accept values around 20, but want values around 40. (read more about phred)

These graphs were generated from the same chromatogram file, but processed by different base calling programs.

I won't tell you which graph was produced by which base caller, but the chromatogram was obtained in 2006 from an ABI 3730 DNA sequencing instrument.

In theory, these graphs should be identical, or at least very similar, but unfortunately, I'm not sure which of these graphs is the one that I should believe. One of the base callers is considerably more optimistic, quality-wise than than the other.

So I'm asking you. What are you using for base callers these days? How are you checking the accuracy of your data?

More like this

do you know the answer? I want to know the answer.

I don't know the answer although I suspect most groups are not comparing qualities at all and either using what they've always used or using the base calling program that comes with the DNA sequencing instrument.

I'm hoping someone will chime in and tell us that they've done some experiments, so I don't have to do them.

Even I do not have a solution. One way I solved such quality problems was to mostly duplicately sequence the clone and merge the two sequences using Phrap and used the Phrap quality scores for determining the sequence quality.

I'd open up the chromatogram file and take a look myself. It's the most important sanity-check you can do.

It's clear to me that at least one of the programs has simply failed to call the sequence properly. I bet if you align the two sequences, you'll find little or no similarity. If so, that should be obvious when you eyeball the chromatogram with the base calls overlaid.

Interestingly, the top trace is about double the length of the bottom one, and the only bases with a reasonable quality score are right at the start. A general rule of thumb is that the first few bases of any sequence are never any good - it's where you get all the dye blobs and other crud. So I would be less inclined to trust the top sequence, but wouldn't go further than that without looking at the trace.

I've seen this happen before though. The general pattern seems to be that the peak-finding algorithm gets misled by the noisy dye-blobby area at the start, and sets the expected spacing between bases wrong. That means it's unable to call the majority of the sequence, as it's looking for the peaks in the wrong place. This is again consistent with the first trace being much longer than the second.

sequence something you know the answer to. compare your base called results to that - and count the 'errors'. the correct bascaller should have the fewest errors - and the Q scores should match the observed error frequencies.

The best advice I can tell you is that both algorithms are probably suffice for most applications. Peter was correct to state that sometimes the algorithm get "started off crudely" because of the initial 10-20 bp. The best thing to do is to align the sequence with a "template" sequence from GenBank if you have one (make sure and delete the beginning and end of the read...makes for better alignment). Then you can actually see the differences and systematically check the errors. If you don't have a template to compare against, use either algorithm and check the ambiguous reads yourself.

The algorithms differ on the way they chose the confidence to assign the peak. The code takes into account the bases and confidence both before and after each base it's assigning a value to. But like I said...both should be fine.

Well, I can't sequence this sample again because this sequence comes from an environmental sequencing project and the bacterial colonies have long since been autoclaved and consigned to the rubbish heap.

But, I did take a closer look at the traces and did find some differences.

I'll post an update later today and put a link from here.

phred is easily thrown off by a large peak early in the chromatogram - really large signal, usually a contaminant. Also, it has difficulty with short reads of < 120 or so. There aren't many parameters you can tweak to help phred. Here is one otion you can include when you run phred:

-nonorm

"Disable phred trace normalization. This
option is not recommended unless the base
caller fails due to huge noise peaks
extending over a large region at the start
of the trace, as is characteristic of some
dye terminator reactions."

This does help quite a bit, but isn't a cure. Use with caution or use KB. KB is more robust in these cases.

Sorry - I got snipped. -nonorm is an option you can use when you call phred. Should not be used in all phred calls. Just the problematic ones where it may help.

Thanks Brad, Peter, Loc, C, Vidhya! I posted screen shots from the chromatograms and my conclusion here.

I am not clear.can anyone explain it clearly.
thanx

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

Universities Can Agree On All Hate Speech Except Antisemitism

More by this author

New home for Discovering Biology in a Digital World

October 30, 2017

Sometime in the next day or two, Scienceblogs will shut down. We've enjoyed the opportunity to blog here for the past 10+ years. Not to worry, @digitalbio and @finchtalk will continue blogging, but more so from their own site at Digital World Biology. The Scienceblogs posts have been…

Synbiobeta: The Future is Now

October 12, 2017

@synbiobeta concluded it’s #sbbsf17 annual meeting on synthetic biology Oct 5, 2017. The progress companies are making in harnessing biology as a platform for manufacturing and problem solving is world changing. Locations of Synbio Companies What is Synthetic Biology? Synthetic biology is a term…

Understanding the CRISPR Cas9 system

September 18, 2016

On Sept. 30th, I'm going to be co-presenting a Bio-Link webinar on Genome Engineering with CRISPR-Cas9 with Dr. Thomas Tubon from Madison College. If you're interested, Register here. Since my part will be to help our audience understand the basics of this system, I prepared a…

Zika virus, drug discovery, and student projects

March 8, 2016

It's well understood in science education that students are more engaged when they work on problems that matter. Right now, Zika virus matters. Zika is a very scary problem that matters a great deal to anyone who might want to start a family and greatly concerns my students. I…

DNA: it's in your blood

February 28, 2016

Did you know small fragments of DNA are circulating in your blood stream? These short pieces of DNA are left behind after cells self-destruct. This self-destruction, or apoptosis, is a normal process. In the case of fetal development, certain cells in our hands die, leaving behind individual…