One of my colleagues has a two part series on FinchTalk (starting today) that discusses uncertainty in measurement and what that uncertainty means for the present and Next Generation DNA sequencing technologies.
I've been running into this uncertainty myself lately.
I have always known that DNA sequencing errors occur. This is why people build tools for measuring the error rate and why quality measurements are so useful for determining which data to use and which data to believe. But, some of the downstream consequences didn't really hit home for me until a recent project. This project involves having students clone and sequence uncharacterized genes from genomic DNA. My part of the project was to do some research and write the bioinformatics section of the student lab manual.
One of the steps in this process involves using shorter DNA sequences to reconstruct a longer sequence of DNA that we call a contig. We call this process "DNA sequence assembly" and we have to do it because of technical limitations.
This time, however, things are a bit different from my past experience in part because this time we have far less data. For many reasons, the quality of the student-generated chromatograms tends to be low, with only 25-50% of the files containing usable data. This means that each student or lab group only has about three to four reads that they can assemble to create their contig. In some cases, this also means that they might only get the sequence from a single strand.
Since I've been testing the project to find out how things will work for the students, I've been doing many of these assemblies with different small data sets and reviewing the results. It's been quite surprising to realize how frequently errors occur.
I'm finding the errors by two different methods. First, I can detect errors when I look at the assemblies. In the case, below I found a position where one read had a deletion relative to the other. When I reviewed the trace in FinchTV, I could see that the base-caller had missed that A. When I find errors like that, I edit the reads in FinchTV to fix the sequence of bases and save my changes back to the iFinch database.
The other place where I detect errors is the step where we compare our proposed genomic sequence to a set of reference mRNAs. In this case, when I look at the blastn results, I can sometimes see alignments that look like this:
In this case, you can see that all of the sequences below my query (shown at the top) have an extra T or C that my query is missing. Again, I go to FinchTV and review the trace to find out if there should be another base in my read that somehow got missed.
I know it's strange, but despite all the assemblies that I've done, it's been working with the small assemblies that have really impressed me with the need to have lots of redundant data. Now, I know what people mean when they say that they minimize errors by collecting more data. I think one of the benefits of this project is that students are going to learn why many of us are excited about Next Generation sequencing technology. The more data we collect, the more we can confirm our results.
I'm certain, in the future, we won't be quite as uncertain.
From your description, it sounds like the students are sequencing plasmids. How are they isolating the DNA? DNA isolation has produced the biggest variability in sequencing results in my experience.
Back in the day when we were actually doing the sequencing ourselves, the senior graduate student could sequence his boiling mini-prep DNA and I couldn't get mine to work. Turned out that he carefully matched is isopropanol volumes to his supernatant volumes and I was ... less than careful. Once I made the adjustment, I could sequence my boiling preps, too.
The students are cloning genomic DNA via nested PCR, and then using PCR and doing Sanger sequencing from their clones.
I'm not worried about the quality of their data. We can identify poor quality chromatograms and discard the ones with too few Q20 bases.
I was just surprised to see how many errors there were in good quality sequences.
It does makes sense, if a value of Q20 means that one in 100 could be a mistake, then if you had a read with over 800 bases and each base had a quality value of 20, there could easily be 8 mistakes. It's just astonishing to see this in practice and not just in theory.
I run a DNA Sequencing core lab and this article was of particular interest to me. I think error rate is definately something to keep in mind but, as Ron points out, the method (and quality) of isolation is probably the biggest factor in sequence quality. There is also the brand of instrument, age of reagents, competence of the technician, etc. to think about. You want to also watch what region of the read you're looking at. The beginings and ends can contain a lot of miscalls. I often tell my customers that their sequence is like a stick of celery. You cut off the leafy part at the top and the part at the bottom that was stuck in the ground and you now have yourself a nice piece of celery. I don't see the error rate with our control that, according to this article, should be occuring. However, Sandra does well in encouraging the researcher to check the quality of the chromatograms and verify. Many people rely only on the text sequence.