How to win the X PRIZE in genomics
In October, 2006, the X PRIZE foundation announced that second X prize would focus on genomics. The first team to successfully sequence 100 human genomes in 10 days will win $10 million dollars.
And I would venture to guess, that the winning team would also win in the IP (intellectual property) game and the genetic testing market since they will gain an unprecedented look at genetic variation.
But when is done really done?
The first trick is defining what it means to be done. My husband says that “a sequencing project is done when the people who are doing it say that it is done.”
How very true.
The human genome project was completed when the National Human Genome Research Institute (NHGRI) announced that it was complete.
Does this mean that every base in the human genome was identified?
It meant that the NHGRI decided that a significant fraction of the parts that they said they were going to sequence had been sequenced. It was good enough.
I’m not sure how the X PRIZE foundation is defining “done” but anyone competing for this prize will need to know the definition of done in order to calculate the depth of read coverage that they will need. I covered this in a previous installment, but for a quick number, they will need to sequence the same region at least 7 times in order to be certain that they’ve sequenced 99.9% of the genome. To put this into perspective, that would mean that out of 3 billion bases, 3 x 109 or 3,000,000,000 bases would NOT be “done.”
What do the X PRIZE contestants have to consider if they are to win?
Once they’ve defined what it means to be “done,” the contestants have to consider the variables that affect the number of reads that need to be sequenced. Unless they have a really cheap technology or unlimited funds, they will need to know how to reduce the number of reads.
The formula that we derived (1) for calculating the number of reads (Rn) is this:
Once a contestant knows what it means to be done, the value of the numerator is fixed. T (the size of the genome) is the same in every human (well, at least within the same sex), and C is the coverage depth.
In order to reduce the number of reads, you must either get longer reads (increase rL) or increase the number of high quality reads (Pf = passing fraction). I wrote in previous days about some types of reads that wouldn’t pass muster (non-random reads, chimeras, E. coli, vector).
Today, I want to show you what happens when reads are short.
When reads are short, much of the information that’s generated from sequencing is useless. The data might confirm other data, but it doesn’t help us put the larger sequence together. This can be seen in the image below.
Trying to assemble sequences from short reads
Restriction enzymes make short reads.
If you haven’t been convinced yet by the data that I’ve presented here and here, that making a genomic library with restriction enzymes is a bad idea, I have more data to show that RE libraries produce clones with, gasp, SHORT READS!
We used the Finch® Suite to look at the sizes of clones from our two RE libraries. One had been made by digesting genomic DNA with AseI, and the other had been made DraI. In the Finch Suite, we have an algorithm that identifies DNA sequences that match those from common vectors. We use the positions of the vector sequences at the 5′ and 3′ end of a read to determine the length of an insert.
It turned out that approximately half of the clones from the RE libraries contained fragments with vector sequences on the 5′ and 3′ ends of the insert (48%, for AseI, and 50% for DraI). This might not have been a problem if long reads were obtained, but our data (graphed below) showed that none of the reads were longer than 750 bases.
Making genomic libraries from restriction enzymes makes lots of short reads
But what about 454, aren’t they one of the constestants? and don’t they get really short reads?
454 is one of the contestants in the X PRIZE race. Their technology is described in very nice Flash animation (Pyrosequencing from 454).
But their sequencing instruments only get “reads” that are about 300 bases long. How do they address this issue of read length?
I can think of a few things that they do that help them out. First, they use a nebulizer to break the DNA up in random positions. Second, the method that they use, with diluting the sample until they are sequencing single molecules, enables them to obtain sequences that are high quality. Third, since they don’t need to clone DNA, they don’t have to cope with reads that are all vector, or E. coli, or chimeras. All of those steps increase the fraction of passing reads (Pf) and help compensate for a shorter read length (rL).
I’m not sure, though, how their technology can get around the last challenge that we will discuss with DNA sequencing: repetitive DNA.
But then, I don’t think the X PRIZE foundation has a religious view of technology. You can probably use multiple strategies, as long as you get the genome sequences done.
1. Porter, S., Slagel, J., and T. Smith. 2004. Analysis of Genomic DNA Library Quality with the Finch®-Server. Geospiza, Inc. You can download the paper as a pdf document from here: http://www.geospiza.com/research/white-papers.htm
Look in the middle of the page.
Read the whole series:
Part I: Introduction
Part II: Sequencing strategies
Part III: Reads and chromats
Part IV: How many reads does it take?
Part V: Checking out the library
Part VI: Chimeras are not just funny-looking animals