Basics: How do you sequence a genome? Part IV. How many reads does it take?

By sporte on January 30, 2007.

"How much do I love you?
I'll tell you no lie.
How deep is the ocean?
How high is the sky?"
- Irving Berlin

The other installments are here:
Part I: Introduction
Part II: Sequencing strategies
Part III: Reads and chromats
Part V: checking out the library

We all know that sequencing a genome must be a lot of work. But unlike love, it is something we can measure. In fact, an important part of genome sequencing is estimating just how much work needs to be done. This is especially important if you're the one paying for it or the one writing the grant proposal.

Coverage depth: or why do we sequence the same bit of DNA several times?
In the earlier installments, I described a bit about sequencing strategies and shotgun sequencing (for a nice review see Green, 1). In 1988, Lander and Waterman determined that if, we were working with DNA that was fragmented at random positions, (i.e. the stuff we use in shotgun sequencing) that the distribution of fragments would follow a Poisson distribution.

They derived and published a simple formula to serve as a starting point for estimating the coverage depth that would be needed in order to sequence a piece of DNA of a set length.

The probability that a base is not sequenced is given by:

P₀=e^-cwherec=fold sequence coverage (c=LN/G),
LN=#bases sequenced, i.e. L=read length and N= # reads, and the constant, e=2.718 (e=2.718281828459)

If we use this formula, we can create a table that shows how likely it is that we will obtain the complete sequence for a piece of DNA vs. the average coverage depth (reads per kb).

Fold coverage	Percent of clone sequenced
0.25 x	22%
0.50 x	39%
0.75 x	53%
1 x	63%
2 x	88%
3 x	95%
4 x	98%
5 x	99.4%
6 x	99.75%
7 x	99.91%
8 x	99.97%
9 x	99.99%
10 x	99.995%

from Lander & Waterman (1). More tables.

You can see from the table above, that with shotgun sequencing, we need an average coverage depth of nine-fold, in order to be 99.99% certain of obtaining the complete sequence for a DNA target (or genome) of a given size (2).

These calculations represent ideal conditions. What happens in a real lab?

Your mileage may vary
We published a white paper few years ago, where we modified the Lander-Waterman calculations, so that we could predict the number of reads that would be required under real laboratory conditions (3).

In a lab, when you make libraries for cloning, you don't get libraries of perfectly sized, perfect clones. You can get collections of clones that can look like this:

And of course, you don't know which clones are which until after you've started sequencing them. Some labs will sample different libraries at the beginning of a sequencing project, so that they can estimate the number of clones that are vector, E. coli, or contain short inserts.

In our paper (3), we derived a new equation, starting from the Lander-Waterman calculations, to take into account common artifacts such as short inserts, contamination with vector or E. coli DNA, or reads with poor quality data. This gives a more realistic estimate of the number of reads necessary to complete a sequencing project.

The formula that we derived was this:

Here are the pieces:

Rn represents the number of reads required to complete a project.
C is the coverage depth (from the Lander-Waterman table).
T is the length in nucleotides of the target DNA.
rL is defined as the average length of a read as measured by the number of bases with Phred quality values greater than 20 (learn about Phred).
Pf represents the pass rate, or the fraction of reads that pass a defined set of quality standards. We use Pf to exclude subclones that correspond to E. coli, vector sequences, and short inserts (defined here, as reads that include both the 5' and 3' ends of a clone).

Since we built all of these statistical measurements into the Finch® System (my company, Geospiza, has spent ten years working on it, so I have to write about it somewhere), we thought it would be interesting to look at the effect of read length and pass rate on the number of reads needed to complete a project and the project cost.

We decided the number of reads and the cost of sequencing a 4,000,000 base pair genome, with a 9 fold coverage and an average cost of $2 per read (this might be cheaper today, I don't know, but it probably depends quite a bit on where you're getting the sequencing done).

For some perspective on genome size, the reference E. coli genome (NC_008253) at the NCBI is 4,938,920 basepairs in size. Typical read lengths have gone up, but when the human genome was sequenced, the average read length at Celera was 545 bases (4), and the human genome is about 3 billion bases ((the haploid size).

So, there you have it. The cost of sequencing is going down, but it's still a bit of work. If you're going to sequence E. coli, you should plan to generate at least 50,000 reads. And, if you should need software to keep track of them all, well, just like Tiggers, that's what we do best.

References:
1. E. Green. 2001. Strategies for the systematic sequencing of complex genomes. Nature Reviews Genetics 2:573-583.

2. Lander, E., Waterman, M. 1988. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 3:231-239.

3. Porter, S., Slagel, J., and T. Smith. 2004. Analysis of Genomic DNA Library Quality with the FinchÂ®-Server. Geospiza, Inc. You can download the paper as a pdf document from here: http://www.geospiza.com/research/white-papers.htm
Look in the middle of the page.

4. Venter, J., et. al. 2001. The Sequence of the Human Genome. Science 291, 1304-1351.

More like this

BLASTing through the flu: activity 5, how similar is similar?

No more delays! BLAST away! Time to blast. Let's see what it means for sequences to be similar. First, we'll plan our experiment. When I think about digital biology experiments, I organize the steps in the following way:

Shotgun Sequencing a Eukaryotic Genome

Shotgun sequencing refers to the process whereby a genome is sequenced and assembled with no prior information regarding the genomic location of any of the DNA we sequence. There are quite a few steps that you have to go through before you have an assembled genome sequence.

Development and Role of the Human Reference Sequence in Personal Genomics

A few weeks back, we published a review about the development and role of the human reference genome. A key point of the reference genome is that it is not a single sequence.

More flu follies: comparing sequences and making trees, activity 4

What tells us that this new form of H1N1 is swine flu and not regular old human flu or avian flu? If we had a lab, we might use antibodies, but when you're a digital biologist, you use a computer.

$2 is a bit high per read. You could probably drop it under $1 per read with a high volume. (I haven't priced high volume sequencing, but I know my university offers sequencing from 96 well plates for under $2 per read. I'd imagine a higher volume would drop below $1 per read.)

But how small are those cloned regions? When shotgun sequencing eukaryotic genomes, the smallest fragments have a mean size of a couple kb -- very few will be small enough to sequence across in a single read (and that's what I'm assuming you mean by "small fragment").

We got the $2 per read value about 3 years ago from an ABRF core lab survey. I agree that it's probably cheaper, now. I just decided it was easier to quote from our paper than do a thorough investigation of current pricing. It's also a little tricky to measure price per read in some ways. University labs are highly subsized and funding off of grants, so it's my impression (and this may not be correct) that their customers don't really know the true costs.

On the size of the cloned regions, I am so glad you mentioned that. You'll just have to read the next post (or maybe our white paper).

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

New home for Discovering Biology in a Digital World

October 30, 2017

Sometime in the next day or two, Scienceblogs will shut down. We've enjoyed the opportunity to blog here for the past 10+ years. Not to worry, @digitalbio and @finchtalk will continue blogging, but more so from their own site at Digital World Biology. The Scienceblogs posts have been…

Synbiobeta: The Future is Now

October 12, 2017

@synbiobeta concluded it’s #sbbsf17 annual meeting on synthetic biology Oct 5, 2017. The progress companies are making in harnessing biology as a platform for manufacturing and problem solving is world changing. Locations of Synbio Companies What is Synthetic Biology? Synthetic biology is a term…

Understanding the CRISPR Cas9 system

September 18, 2016

On Sept. 30th, I'm going to be co-presenting a Bio-Link webinar on Genome Engineering with CRISPR-Cas9 with Dr. Thomas Tubon from Madison College. If you're interested, Register here. Since my part will be to help our audience understand the basics of this system, I prepared a…

Zika virus, drug discovery, and student projects

March 8, 2016

It's well understood in science education that students are more engaged when they work on problems that matter. Right now, Zika virus matters. Zika is a very scary problem that matters a great deal to anyone who might want to start a family and greatly concerns my students. I…

DNA: it's in your blood

February 28, 2016

Did you know small fragments of DNA are circulating in your blood stream? These short pieces of DNA are left behind after cells self-destruct. This self-destruction, or apoptosis, is a normal process. In the case of fetal development, certain cells in our hands die, leaving behind individual…