Basics: How do you sequence a genome?

About a week ago, I offered to answer questions about subjects that I've either worked with, studied or taught.

I haven't had many questions yet, but I can certainly answer the ones I've had so far. Today, I'll answer the first question:

How do you sequence a genome?

Before we get into the technical details, there are some other genomic questions that you might like answered.

How much does it cost to sequence a genome?

I remember in 2002, when we were at the O'Reilly bioinformatics conference and we heard Lee Hood challenge the DNA sequencing community to lower the costs of genomic sequencing to $1000 for a human genome. It was all pretty exciting!

We're not there yet. But, we're getting closer. I've heard secondhand, from one of our customers, that it costs about $10,000 to sequence an average-sized bacterial genome, once you've purchased your sequencers, bought your software, and built your lab. Just for a bit of perspective, an average bacterial genome is about 750 times smaller than the human genome.

I'll leave you to do the math, but I imagine it scales pretty well. Ten million for a human genome seems about right, especially considering the original version was estimated to cost about 3 billion dollars.

What kind of infrastructure do you need to have?

You will need lots of robots for pipetting and preparing DNA, DNA sequencing instruments, computers, and software for tracking samples, evaluating sequence quality, and assembling the sequences at the end.

Some of the other types of equipment will depend on the methods that you're using. If you're using an older method, you'll need autoclaves and special incubators for growing bacteria. If you're using a newer method, like pyrosequencing, you need to have a special clean room where you can work with a lower risk of contamination.

Fine, so how do you go about doing it?

This used to be an easier question to answer. But now that pyrosequencing (from 454) has come along, this answer isn't as simple.

Still, I can divide the steps into three general parts, and then, since there are some nice movies and Flash® animations on the internet, I will send you out to go watch them.

Here are the steps:

  • Break the genome into lots of small pieces at random positions.
  • Determine the sequence of each small piece of DNA.
  • Use an assembly program to figure out which pieces fit together.

The last two steps are a lot like determining what was written in the Dead Sea Scrolls.

Stay tuned, there will be more.

And there is:
Part II: Sequencing strategies
Part III: Reads and chromats
Part IV: How many reads does it take?
Part V: checking out the library

More like this

I'm not sure if this is the right place to ask, but what is shotgun sequencing? I've always wondered how they're able to sequence AND differentiate different species...

This is a fine place to ask.

Shotgun sequencing is a strategy for determining a DNA sequence that involves breaking a DNA molecule into several smaller pieces, then determining the sequence of DNA in each piece, and last, using software to put the smaller pieces together into a longer piece.

It's called "shotgun sequencing" because it doesn't involve mapping.

As far as differentiating between species, this is pretty easy to do. You know where you got your DNA sample, so you only need to distinguish between the DNA pieces that you're trying to sequence and DNA from the vector or from E. coli. That's pretty easy to do using standard sequence comparison programs like BLAST or cross_match.

I'll discuss shotgun sequencing in more in detail in the future posts on this subject.

Ten million for a human genome seems about right, especially considering the original version was estimated to cost about 3 billion dollars.

According to a current press release from Solexa:

Solexa expects its first-generation instrument, the 1G Genome Analyzer, to generate over a billion bases of DNA sequence per run and to enable human genome resequencing below $100,000 per sample, making it the first platform to reach this important milestone.

Their 1G machine allows sequencing of 1 billion basepairs per run. It is a chip based massive parallel modified Sanger sequencing method. The principle is depicted here:
http://www.solexa.com/technology/sbs.html
and
http://www.solexa.com/technology/demo.html

I'm not sure if this is the right place to ask, but what is shotgun sequencing? I've always wondered how they're able to sequence AND differentiate different species...

maybe this is a reference to metagenomics?

like in this paper:
http://www.sciencemag.org/cgi/content/abstract/304/5667/66

if you sequence DNA from a microbial community, there's a certain stretch of DNA that acts as sort of a tag for a bacterial species. The amounts of time you see the tag and all the variants of it act as a count on the abundance of different species, and you can make a phylogeny from them.

Sequences can still be assembled as normal, it's just that it's difficult to know when you have a complete genome from a given species. In metagenomics, however, that isn't the goal-- instead, you want to look at which genes are present, which sequences are already in databases, which are novel, etc.

If there are only a couple species, you can distinguish them by GC content or some other measure of base composition.

P-ter

Good point about the possible metagenomics slant to that question. You're right, in those instances you are not sequencing genomes, you're taking a sample and looking to find out what's present in that sample. Usually, people identify bacteria by looking at the genes for ribosomal RNA, but GC content is helpful, too.