The general steps in genome sequencing were presented in the earlier installments ( there are links at the bottom of the page), but it's worth repeating them again since each of the earlier steps has a bearing on the outcome of those that come later.
- Break the genome into lots of small pieces at random positions.
- Determine the sequence of each small piece of DNA.
- Use an assembly program to figure out which pieces fit together.
That first step, making a collection of DNA fragments (a library), with breakpoints at random positions is of critical importance to the success of later steps. As you can see in the image below, if you're going to reconstruct a genome sequence from pieces of DNA, you want pieces that overlap at several different points. If you don't have clones that begin and end at random points, you can't put the genome back together again.
We had the opportunity a few years ago, to see someone test this idea and look at what happens with different methods of library preparation. We were hired as consultants to oversee some contract genome sequencing work and evaluate the quality of the sequencing operation. At that time, the broken bits of DNA for libraries were often prepared through sonication (1). Sonication involves bombarding something with sound waves. When a solution of DNA in a tube is sonicated, the DNA breaks at random positions. You can control the average-sized piece of DNA that's produced by changing the length of the sonication time, but it's kind of a crude technique. So, perhaps it's not surprising that every few years or so, someone will try other methods.
One other method for breaking DNA into pieces is to digest it with restriction enzymes. If you use restriction enzymes and limit the enzyme concentration or the digestion time, you can obtain conditions the DNA gets cut at some sites and not at others. When only some of the potential cut sites are actually cut, we call this a "partial digest." It seemed likely to the sequencing company (this was their first contract), that the digestion sites would be random and that this method could be used for making a random library.
What happens when you make a library by digesting DNA with restriction enzymes?
Two different libraries were prepared by using restriction digests, one used DraI and the other, AseI. DNA was isolated and sequenced from E. coli colonies that had been transformed with samples from each of the libraries. The chromatograms were loaded in the Finch® Server, processed through the standard analysis and assembly pipelines, and we looked at the results.
I'm going to present some of the results in a later post and for the moment concentrate on whether or not the fragments are random.
Did the libraries consist of random fragments?
Our past experience with partial RE digests suggested that it might be difficult to control the extent of RE digestion, leading to bias in the start positions of reads relative to each other. On the basis of those past observations, we decided to test whether, in fact, RE digest libraries represented random or non-random subclones. To test this idea, we assembled the different libraries and looked at the positions where the reads aligned to the contigs.
Figure 3 (from ref. 2) shows an example report, from the Geospiza Finch Suite, that identifies where reads align to a contig sequence. You can see that many reads begin and end at the same positions in the contig. We also graphed some of these results with DrawMap (3) so you can see where different reads from the AseI library align to the contig.
Alignment between reads from the AseI library and one of the contigs (2).
The genomic libraries that were prepared through partial digestion with restriction enzymes consisted on non-random fragments. But why would a library of short non-random fragments be a bad thing?
The image below, and also the DrawMap graph above, show that it will be difficult to assemble or reconstruct a DNA sequence with non-random fragments. If the fragments do not overlap each other, you can't join them together and you must do additional work and spend additional time to determine how the pieces fit together in the genome puzzle.
1. E. Green. 2001. Strategies for the systematic sequencing of complex genomes. Nature Reviews Genetics 2:573-583.
2. Porter, S., Slagel, J., and T. Smith. 2004. Analysis of Genomic DNA Library Quality with the Finch®-Server. Geospiza, Inc. You can download the paper as a pdf document from here: http://www.geospiza.com/research/white-papers.htm
Look in the middle of the page.
3. Smith, T.M., Lee, M.K., Szabo, C.I., Jerome, N., McEuen, M., Taylor, M., Hood, L., and M.C. King. 1996. Complete Genomic Sequence and Analysis of 117 kb of Human DNA Containing the Gene BRCA1. Genome Res. 6:1029-49.
You can see that many reads begin and end at the same positions in the contig.
I come across this a lot when blasting against tracefiles from Drosophila sequencing project. But when I see the pattern it's due to repetitive DNA. For example, if a blast a region containing a sequence that is present multiple times throughout the genome (either a duplicated region or transposable element) I'll get traces that came from the region I'm searching with along with traces from paralogous regions. The paralogous sequences will tend to have alignments that terminate at the same exact spot because that's where the duplicated region ends and unique sequence begins.
You're describing a different kind of thing and a different kind of experiment. I will get to repetitive DNA later in this series, but in this case, we were looking at a genome that doesn't contain repetitive DNA, so we knew that wasn't the problem.
Also, I forgot to mention that we had the control experiments, where a library from this same organism was prepared by sonication. So, we knew the explanation for our results.
With what you're seeing, there are many different reasons why blast alignments will show reads beginning and ending at the same positions in contigs. Exons and introns are one reason, repeats are another. I will do more things with blast later on and go through some of these.