# Basics: How do you sequence a genome? Part IV. How many reads does it take?

"How much do I love you?
I'll tell you no lie.
How deep is the ocean?
How high is the sky?"
- Irving Berlin

The other installments are here:
Part I: Introduction
Part II: Sequencing strategies
Part V: checking out the library

We all know that sequencing a genome must be a lot of work. But unlike love, it is something we can measure. In fact, an important part of genome sequencing is estimating just how much work needs to be done. This is especially important if you're the one paying for it or the one writing the grant proposal.

Coverage depth: or why do we sequence the same bit of DNA several times?
In the earlier installments, I described a bit about sequencing strategies and shotgun sequencing (for a nice review see Green, 1). In 1988, Lander and Waterman determined that if, we were working with DNA that was fragmented at random positions, (i.e. the stuff we use in shotgun sequencing) that the distribution of fragments would follow a Poisson distribution.

They derived and published a simple formula to serve as a starting point for estimating the coverage depth that would be needed in order to sequence a piece of DNA of a set length.

The probability that a base is not sequenced is given by:

• P0=e-c wherec=fold sequence coverage (c=LN/G),
• LN=#bases sequenced, i.e. L=read length and N= # reads, and the constant, e=2.718 (e=2.718281828459)

If we use this formula, we can create a table that shows how likely it is that we will obtain the complete sequence for a piece of DNA vs. the average coverage depth (reads per kb).

 Fold coverage Percent of clone sequenced 0.25 x 22% 0.50 x 39% 0.75 x 53% 1 x 63% 2 x 88% 3 x 95% 4 x 98% 5 x 99.4% 6 x 99.75% 7 x 99.91% 8 x 99.97% 9 x 99.99% 10 x 99.995%

from Lander & Waterman (1). More tables.

You can see from the table above, that with shotgun sequencing, we need an average coverage depth of nine-fold, in order to be 99.99% certain of obtaining the complete sequence for a DNA target (or genome) of a given size (2).

These calculations represent ideal conditions. What happens in a real lab?

We published a white paper few years ago, where we modified the Lander-Waterman calculations, so that we could predict the number of reads that would be required under real laboratory conditions (3).

In a lab, when you make libraries for cloning, you don't get libraries of perfectly sized, perfect clones. You can get collections of clones that can look like this: And of course, you don't know which clones are which until after you've started sequencing them. Some labs will sample different libraries at the beginning of a sequencing project, so that they can estimate the number of clones that are vector, E. coli, or contain short inserts.

In our paper (3), we derived a new equation, starting from the Lander-Waterman calculations, to take into account common artifacts such as short inserts, contamination with vector or E. coli DNA, or reads with poor quality data. This gives a more realistic estimate of the number of reads necessary to complete a sequencing project. Here are the pieces:

• Rn represents the number of reads required to complete a project.
• C is the coverage depth (from the Lander-Waterman table).
• T is the length in nucleotides of the target DNA.
• rL is defined as the average length of a read as measured by the number of bases with Phred quality values greater than 20 (learn about Phred).
• Pf represents the pass rate, or the fraction of reads that pass a defined set of quality standards. We use Pf to exclude subclones that correspond to E. coli, vector sequences, and short inserts (defined here, as reads that include both the 5' and 3' ends of a clone).

Since we built all of these statistical measurements into the Finch® System (my company, Geospiza, has spent ten years working on it, so I have to write about it somewhere), we thought it would be interesting to look at the effect of read length and pass rate on the number of reads needed to complete a project and the project cost.

We decided the number of reads and the cost of sequencing a 4,000,000 base pair genome, with a 9 fold coverage and an average cost of \$2 per read (this might be cheaper today, I don't know, but it probably depends quite a bit on where you're getting the sequencing done).

For some perspective on genome size, the reference E. coli genome (NC_008253) at the NCBI is 4,938,920 basepairs in size. Typical read lengths have gone up, but when the human genome was sequenced, the average read length at Celera was 545 bases (4), and the human genome is about 3 billion bases ((the haploid size). So, there you have it. The cost of sequencing is going down, but it's still a bit of work. If you're going to sequence E. coli, you should plan to generate at least 50,000 reads. And, if you should need software to keep track of them all, well, just like Tiggers, that's what we do best.

References:
1. E. Green. 2001. Strategies for the systematic sequencing of complex genomes. Nature Reviews Genetics 2:573-583.

2. Lander, E., Waterman, M. 1988. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 3:231-239.

3. Porter, S., Slagel, J., and T. Smith. 2004. Analysis of Genomic DNA Library Quality with the FinchÂ®-Server. Geospiza, Inc. You can download the paper as a pdf document from here: http://www.geospiza.com/research/white-papers.htm
Look in the middle of the page.

4. Venter, J., et. al. 2001. The Sequence of the Human Genome. Science 291, 1304-1351.