Basics: How do you sequence a genome, part II

Considering that several genomes that have been sequenced in the past decade, it seems amazing in retrospect, that the first complete bacterial genome sequence was only published 12 years ago (1). Now, the Genome database at the NCBI lists 450 complete microbial genomes (procaryotes and archea), 1476 genomes from eucaryotes, 2145 viruses, and genome sequences from 407 phage.

Much of the methodology used for sequencing DNA is designed to confront one big technical hurdle.

That is, we can only determine the sequence of small pieces of DNA at a time. This means that you must break a larger piece of DNA into smaller pieces, determine the sequence of each piece, and then put the sequence together.

i-50582a2325bcfd5072220fb856ac2ff0-DNA_seq_challenge.gif

Mapping vs. Shotugn
When people were sequencing smaller pieces of DNA, in the 80's, it was common to map the DNA first using restriction enzymes, so that you knew how the pieces fit together. At first, many insisted that this same strategy should be applied to genomes as well. There were those who argued that genomes should be broken apart and each piece carefully mapped before sequencing began. And on the other hand, there was Craig Venter arguing that genome sequencing would be much quicker with a shot-gun approach.

Thinking along the lines of a traditional laboratory, where the labor is cheap and the reagents are expensive, the mapping approach seemed pretty logical. Each piece of DNA would be carefully mapped, so you would know where it fit into a larger piece, and then sequenced. The downside of mapping first, is that there's a cost in terms of time and of labor. Currently, you can obtain sequences that are about 900 bases long, using ABI instruments and chemistry. This would mean that to sequence a genome, like that of E. coli, that's 4,638,858 bp in length (2), by mapping it first, you would need at least 6000 fragments that were well mapped. The shot gun approach, where DNA is broken into several overlapping pieces, each piece is sequenced, and computer programs figure out how the pieces fit together, turned out to be much faster, and less costly in terms of labor.

Today, genome sequencing uses a combination of mapping and shot gun sequencing. Large pieces of DNA, on the order of 150,000 bp, are first cloned in BACs (Bacterial Artificial Chromosomes). The positions of the BACs are mapped, so it's known where they fit relative to each other and where they overlap. Then the sequence of each BAC is determined using a shotgun strategy.

I'll write more on the shot gun approach in the next post.

Read part I.
Part III: Reads and chromats
Part IV: How many reads does it take?
Part V: checking out the library

References:
1. Fraser CM, et.al. 1995 "The minimal gene complement of Mycoplasma genitalium." Science. Oct 20;270(5235):397-403.

2. Koonin, E. 1997. "Big Time for Small Genomes." Genome Research, 7:418-421.

More like this

This would mean that to sequence a genome, like that of E. coli, that's 4,638,858 bp in length (2), by mapping it first, you would need at least 6000 fragments that were well mapped.

That's a bit misleading. I don't think anyone was ever suggesting that each sequence read be mapped. A pure shotgun approach would involve no mapping of clones. A non-shotgun approach would map each clone, digest the clones, and sequence the fragments from each clone. Each read isn't mapped, but the clone from which they came is mapped.

Today, genome sequencing uses a combination of mapping and shot gun sequencing. Large pieces of DNA, on the order of 150,000 bp, are first cloned in BACs (Bacterial Artificial Chromosomes). The positions of the BACs are mapped, so it's known where they fit relative to each other and where they overlap. Then the sequence of each BAC is determined using a shotgun strategy.

The hybrid strategy maps a subset of clones, but the clones are not digested and sequenced. Instead, only paired end reads are generated for lots of clones of multiple sizes (from a few kb to large BACs). Some of the large clones are mapped to anchor the scaffolds.

I was actually planning to write about this topic soon (the que of intended posts is getting too long...). I may pump out a complementary post to yours.

That's a bit misleading. I don't think anyone was ever suggesting that each sequence read be mapped.

I didn't say "each read." I wrote "each fragment" (although I guess I should have specified that the fragments were clones). My crude estimate is also probably an underestimate, since at the time when these issues were most strongly debated (early-mid 90's) the reads were much shorter (more like 300-500 bp). You would also want a large number of clones so that you could have clones that overlapped.

I think a complementary post is a good idea. I will be posting more on this subject as well.