Shotgun Sequencing a Eukaryotic Genome

Shotgun sequencing refers to the process whereby a genome is sequenced and assembled with no prior information regarding the genomic location of any of the DNA we sequence. There are quite a few steps that you have to go through before you have an assembled genome sequence. We’re going to cover isolating DNA, putting the DNA in bacteria, sequencing the DNA, and assembling those sequencing into a complete genome.

Sandy has been running a series on sequencing genomes (parts 1, 2, 3, 4, 5, 6, 7). You should go check it out even if you read this post; while I’m going to deal with some of the basics of shotgun sequencing, she goes over some things that I will not. This post will cover how genome sequencing projects go from organisms to assembled genomes, but there are certain details that I will be leaving out that Sandy has explained quite well.

Isolating and Cloning DNA
Before we can begin sequencing DNA, we must isolate it from dead organisms (or parts of those organisms). There are multiple ways to isolate DNA, but they all involve breaking down the tissues and cells to isolate the nuclei (the membrane bound intracellular structure containing the genomic DNA), then breaking down the nuclei to remove the DNA inside. Once we have the DNA, we can directly sequence it only if we have prior information regarding the some of the sequence of the region we would like to sequence. If we want to sequence an entire genome, we will not have enough sequence information to directly sequence the genomic DNA.

Once we have isolated the DNA, we break it into fragments of different sizes (for reasons discussed below). Those fragments are then mixed with bacteria, and some of the bacteria take up the DNA which gets incorporated into extragenomic DNA sequences called plasmids. Because we know the sequences of those plasmids, we can easily sequence the fragments that are inserted into the plasmids (the fragment shown as a red block in the figure below). Each of those plasmids is known as a clone.


In order to sequence the DNA the old fashioned way (there are some new fangled techniques we won’t deal with here), we use primers to initiate the sequencing reaction. Those primers are designed to match the known sequence of the plasmids flanking the region containing our DNA of interest that was inserted into the plasmid (shown as green arrows in the figure above). DNA sequencing can only handle a few hundred nucleotides (DNA letters), and the genomic fragments are on the order of thousands of nucleotides. That means we don’t get the entire sequence of the fragment, but we do generate sequences of the ends of the fragments (squiggly lines in the figure). Furthermore, we can keep a record of from which clones the end sequences come, so we know that each pair of end reads should be located in the same genomic region.

Assembling Shotgun Reads
This aspect of shotgun sequencing will receive the brunt of my focus. Hopefully I’ve set this up properly by describing end sequencing of reads because that is secret to shotgun sequencing. That’s a hint to go back and read the previous paragraph and look at the previous figure if you skimmed it over. The sequencing strategy is important. Real important.

Once the DNA sequencing is completed, the sequences are assembled like a puzzle. Ideally, the fragments overlap each other so that the sequences that partially overlap each other will be joined together to form larger sequences. We also would like to have small, medium, and large fragments covering each region (see below). This process continues until all the overlapping sequences are assembled into a bunch of really long sequences known as contigs.


But the contigs only cover portions of each chromosome, and the goal is to have a single sequence that covers an entire chromosome. For various reasons (including repetitive DNA and lack of sequence for all genomic regions) there are genomic sequences that fail to assemble into the contigs. The next best thing we can do is try to fit the contigs together into a single sequence known as a scaffold.


In the figure shown above, three contigs are combined into a single scaffold. The arrows indicate paired end reads of clones — the red arrows are from one clone and the blue arrows are from a different clone. If multiple paired end reads are located at the ends of two contigs, we can join the contigs into a single scaffold. The red region of the scaffold is the sequence that came from the contigs, and the black region is the sequence inferred to be located between the contigs. Only we don’t know what those black sequences are, so we fill them with unknown nucleotides and refer to that region as a gap. While it may seem problematic to introduce gaps into a genome assembly, we make up for the cost of the gaps because they allow us to assemble contigs into scaffolds.

The scaffolds are then assigned to chromosomes using a few different strategies. If we know something about the molecular genetics of the organism we’re studying, we can identify genes that have been previously mapped to chromosomes within our scaffolds. If a scaffold contains a gene (or, even better, multiple genes) that is known to be located on a particular chromosome, the scaffold most likely came from that chromosome. And if we know the order of the genes on the chromosome, we can designate and orientation to the scaffold and order multiple scaffolds on a chromosome.

If we lack known genetic markers or the marker set is poor, we can take the some of the clones that we sequenced and map them to chromosomes using in situ hybridization of clones to the actual chromosomes. This involves attaching a fluorescent tag to the cloned sequence and mixing the tagged clone with cells from the organism. The nuclei of the cells can be observed under a microscope and the chromosomes visualized. The clone will anneal to the chromosome from which in came, allowing you to map the clone to a particular chromosome. The contig and scaffold containing the sequences from that clone can then be assigned to that chromosome.

Once all of the sequences have been assembled into contigs, the contigs assembled into scaffolds, and the scaffolds assigned to chromosomes, we have a “draft” assembly of a genome — not until we minimize the gaps to an acceptable standard can we call the assembly “finished”. Most sequenced genomes do not make it beyond the draft stage, as the finishing process is expensive; a draft sequence is usually good enough for most genomes. The genome sequence then gets annotated, a process that involves finding genes and other sequences in the scaffolds. This is done using gene prediction algorithms, comparisons with other annotated genomes and known genes, and other computational techniques.