Shotgun Sequencing a Eukaryotic Genome

By evolgen on February 7, 2007.

Shotgun sequencing refers to the process whereby a genome is sequenced and assembled with no prior information regarding the genomic location of any of the DNA we sequence. There are quite a few steps that you have to go through before you have an assembled genome sequence. We're going to cover isolating DNA, putting the DNA in bacteria, sequencing the DNA, and assembling those sequencing into a complete genome.

Sandy has been running a series on sequencing genomes (parts 1, 2, 3, 4, 5, 6, 7). You should go check it out even if you read this post; while I'm going to deal with some of the basics of shotgun sequencing, she goes over some things that I will not. This post will cover how genome sequencing projects go from organisms to assembled genomes, but there are certain details that I will be leaving out that Sandy has explained quite well.

Isolating and Cloning DNA
Before we can begin sequencing DNA, we must isolate it from dead organisms (or parts of those organisms). There are multiple ways to isolate DNA, but they all involve breaking down the tissues and cells to isolate the nuclei (the membrane bound intracellular structure containing the genomic DNA), then breaking down the nuclei to remove the DNA inside. Once we have the DNA, we can directly sequence it only if we have prior information regarding the some of the sequence of the region we would like to sequence. If we want to sequence an entire genome, we will not have enough sequence information to directly sequence the genomic DNA.

Once we have isolated the DNA, we break it into fragments of different sizes (for reasons discussed below). Those fragments are then mixed with bacteria, and some of the bacteria take up the DNA which gets incorporated into extragenomic DNA sequences called plasmids. Because we know the sequences of those plasmids, we can easily sequence the fragments that are inserted into the plasmids (the fragment shown as a red block in the figure below). Each of those plasmids is known as a clone.

Sequencing
In order to sequence the DNA the old fashioned way (there are some new fangled techniques we won't deal with here), we use primers to initiate the sequencing reaction. Those primers are designed to match the known sequence of the plasmids flanking the region containing our DNA of interest that was inserted into the plasmid (shown as green arrows in the figure above). DNA sequencing can only handle a few hundred nucleotides (DNA letters), and the genomic fragments are on the order of thousands of nucleotides. That means we don't get the entire sequence of the fragment, but we do generate sequences of the ends of the fragments (squiggly lines in the figure). Furthermore, we can keep a record of from which clones the end sequences come, so we know that each pair of end reads should be located in the same genomic region.

Assembling Shotgun Reads
This aspect of shotgun sequencing will receive the brunt of my focus. Hopefully I've set this up properly by describing end sequencing of reads because that is secret to shotgun sequencing. That's a hint to go back and read the previous paragraph and look at the previous figure if you skimmed it over. The sequencing strategy is important. Real important.

Once the DNA sequencing is completed, the sequences are assembled like a puzzle. Ideally, the fragments overlap each other so that the sequences that partially overlap each other will be joined together to form larger sequences. We also would like to have small, medium, and large fragments covering each region (see below). This process continues until all the overlapping sequences are assembled into a bunch of really long sequences known as contigs.

But the contigs only cover portions of each chromosome, and the goal is to have a single sequence that covers an entire chromosome. For various reasons (including repetitive DNA and lack of sequence for all genomic regions) there are genomic sequences that fail to assemble into the contigs. The next best thing we can do is try to fit the contigs together into a single sequence known as a scaffold.

In the figure shown above, three contigs are combined into a single scaffold. The arrows indicate paired end reads of clones -- the red arrows are from one clone and the blue arrows are from a different clone. If multiple paired end reads are located at the ends of two contigs, we can join the contigs into a single scaffold. The red region of the scaffold is the sequence that came from the contigs, and the black region is the sequence inferred to be located between the contigs. Only we don't know what those black sequences are, so we fill them with unknown nucleotides and refer to that region as a gap. While it may seem problematic to introduce gaps into a genome assembly, we make up for the cost of the gaps because they allow us to assemble contigs into scaffolds.

The scaffolds are then assigned to chromosomes using a few different strategies. If we know something about the molecular genetics of the organism we're studying, we can identify genes that have been previously mapped to chromosomes within our scaffolds. If a scaffold contains a gene (or, even better, multiple genes) that is known to be located on a particular chromosome, the scaffold most likely came from that chromosome. And if we know the order of the genes on the chromosome, we can designate and orientation to the scaffold and order multiple scaffolds on a chromosome.

If we lack known genetic markers or the marker set is poor, we can take the some of the clones that we sequenced and map them to chromosomes using in situ hybridization of clones to the actual chromosomes. This involves attaching a fluorescent tag to the cloned sequence and mixing the tagged clone with cells from the organism. The nuclei of the cells can be observed under a microscope and the chromosomes visualized. The clone will anneal to the chromosome from which in came, allowing you to map the clone to a particular chromosome. The contig and scaffold containing the sequences from that clone can then be assigned to that chromosome.

Once all of the sequences have been assembled into contigs, the contigs assembled into scaffolds, and the scaffolds assigned to chromosomes, we have a "draft" assembly of a genome -- not until we minimize the gaps to an acceptable standard can we call the assembly "finished". Most sequenced genomes do not make it beyond the draft stage, as the finishing process is expensive; a draft sequence is usually good enough for most genomes. The genome sequence then gets annotated, a process that involves finding genes and other sequences in the scaffolds. This is done using gene prediction algorithms, comparisons with other annotated genomes and known genes, and other computational techniques.

More like this

A Final Observation on the Human Microbiome Research Conference: An Underappreciated Breakthrough

A couple of weeks ago I attended the Human Microbiome Research Conference. At that meeting, one talk by Bruce Birren (and covered by Jonathan Eisen) mentioned something that was completely overlooked by the attendees. Now, I don't blame them, since what Birren mentioned was about bacterial…

Sequencing a Genome, part V: checking out the library

The general steps in genome sequencing were presented in the earlier installments ( there are links at the bottom of the page), but it's worth repeating them again since each of the earlier steps has a bearing on the outcome of those that come later. These are: Break the genome into lots of small…

The Future of Bacterial Genomics: It's Not the Sequencing, It's the...

...assembly and analysis. The Wellcome Trust has a very good (and mostly accurate) article about the 'next-gen' sequencing technologies. I'm going to focus on bacterial genomics because humans are boring (seriously, compared to two bacteria in the same species, once you've seen one human genome,…

The Future of Bacterial Genomics: It's Not the Sequencing, It's the...

...assembly and analysis. From the depths of the Mad Biologist's Archives comes this post. The Wellcome Trust has a very good (and mostly accurate) article about the 'next-gen' sequencing technologies. I'm going to focus on bacterial genomics because humans are boring (seriously, compared to two…

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

This is a Good-bye Post

January 16, 2009

This is the final post ever at evolgen. It was a fun 4+ years, the last three spent at ScienceBlogs, but it has come time for me to close up shop. When I first got into blogging, I did it as a way to share what was on my mind to the few people who would read what I had to say (usually in topics…

Mendel's Garden #27 - Call for Submissions

January 2, 2009

Mendel's Garden is the original genetics blog carnival. The next edition will be hosted by Jeremy at Another Blasted Weblog. If you would like to submit a blog post to be included in the carnival, send an email to Jeremy (jcherfas at mac dot com). The carnival should be posted within the next few…

Eric Lander Teaches?

December 20, 2008

John Hawks points out that Eric Lander has been appointed to co-chair Obama's Council of Advisers on Science and Technology along with science adviser John Holdren and Nobel Laureate Harold Varmus. Here's how the AP article describes Lander: Lander, who teaches at both MIT and Harvard, founded the…

The Implementation of Molecular Evolution for the Masses

December 18, 2008

A couple of years ago, there was talk in the bioblogosphere about getting the general public interested in bioinformatics and molecular evolution: Amateur bioinformatics? Lowering the Ivory Tower with Molecular Evolution Molecular Evolution for the Masses The idea was inspired by the findings of…

Do people still use microarrays?

December 17, 2008

Larry Moran points to a couple of posts critical of microarrays (The Problem with Microarrays): Why microarray study conclusions are so often wrong Three reasons to distrust microarray results Microarrays are small chips that are covered with short stretches of single stranded DNA. People…

Shotgun Sequencing a Eukaryotic Genome

More like this

A Final Observation on the Human Microbiome Research Conference: An Underappreciated Breakthrough

Sequencing a Genome, part V: checking out the library

The Future of Bacterial Genomics: It's Not the Sequencing, It's the...

The Future of Bacterial Genomics: It's Not the Sequencing, It's the...

This is a Good-bye Post

Mendel's Garden #27 - Call for Submissions

Eric Lander Teaches?

The Implementation of Molecular Evolution for the Masses

Do people still use microarrays?

Weekend Diversion: Uncovering the Forgotten Originals (and a Bonus!)

Form Follows... Visual Literacy?

Friday Cephalopod: Squeee! Babies!