A couple of weeks ago I suggested that the National Human Genome Research Institute (NHGRI) would no longer be funding de novo genome sequencing projects via white papers. They appear to be shifting their focus to resequencing projects to study variation (ie, this) and take a closer look at well studied organisms (ie, ENCODE, which now has Drosophila and Caenorhabditis versions). But the distribution of genomic resources is extremely biased towards a few species. What should those researchers who work on organisms without genome sequences do if they can't solicit funds from the NHGRI?
As pointed out in the comments of my previous post, de novo sequencing of microbial genomes is still going strong. This research is heavily funded by the Joint Genome Institute (JGI), so a change in focus at the NHGRI won't affect microbiologists. In fact, this is pretty much true for anyone interested in sequencing a small genome (bacterial, archael, or eukaryotic). A moderate research grant from JGI or the National Science Foundation (NSF) would fund a de novo whole genome sequencing project for many (most?) species (sorry for the bias toward US funding agencies, but those are the ones with which I am familiar). One recent example (from outside the US) is reported in this paper from Ken Wolfe's group in which they analyze gene loss following whole genome duplication in a yeast species that diverged from S. cerevisiae soon after a whole genome duplication event. The authors were able to sequence and analyze a new yeast genome (Kluyveromyces polysporus) in such a matter of fact way because of the small size of these yeast genomes and the availability of multiple other yeast genomes for comparison. In fact, it's not obvious from the title of the paper that they are reporting the whole genome sequence of a new species -- that's how ho-hum de novo sequencing projects have become in some taxa.
Resequencing projects and moderately funded de novo whole genome sequencing projects are only possible for species from well studied taxa or those with small genomes. How should one go about soliciting funds to sequence the genome of a poorly studied species or one with a large genome (presumably due to a large amount of repetitive DNA)? Some may argue that next generation sequencing technologies (eg, 454 or Solexa) will allow for de novo sequencing of whole eukaryotic genomes in the near future. I disagree. These new technologies can be used for resequencing of large genomes (for instance, James Watson) or sequencing genomes from well studied taxa for which there are closely related species with available whole genome sequences to aid in the assembly (the Neanderthal projects come to mind). But the read lengths generated from 454 and Solexa simply aren't long enough to go after a large genome from a species for which there is not available sequence from a close relative to aid in assembly.
Given that the funds are not available for de novo shotgun sequencing of large genomes via the Sanger technique, and we're not quite to a point where the new technologies can replace the Sanger method, what can we do to study the genomes of these "non-sequencable" species? My solution: use 454 to sequence cDNA libraries. First of all, this isn't really my solution, but one that I have been hearing from a few different labs that work on "non-sequencable" species. Second of all, yes, I realize that this is not a great substitute for a whole genome project. A sequenced cDNA library will only provide information about the transcribed sequences in a genome (and the annotation will be biased toward protein coding sequences), but this is far superior to no sequence. Furthermore, funding for a sequenced cDNA library for a single species can be incorporated in a modest grant proposal, and said proposal could even include funds to sequence multiple individuals from a species or multiple closely related species.
With a large fraction of protein coding sequences, a fair bit of analysis can be performed. The scope of a study can be greatly improved if polymorphism data or sequences from closely related species are generated. From what I have heard, the cost of sequencing a cDNA library using 454 is not much more than $10,000, so comparative transcriptomics is quite feasible without much of an investment. Additionally, these sequences would provide a valuable resource for researchers whose study organisms lack good sequences from which to develop molecular markers. A moderate front end investment to sequence an entire cDNA library for a poorly studied organism could yield substantial benefits for future molecular work in that species.
Keep in mind, I do not propose that 454 sequencing of cDNA libraries is the finish line for "non-sequencable" species. Rather, we are currently caught in a holding pattern during which the funding and technology are not available for de novo whole genome sequencing of these species. Once the cost and quality of next generation sequencing technologies reach the appropriate levels to allow for sequencing of whole genomes from any and all species, then we can make the switch to those technologies. Until then, sequencing cDNA libraries offers an affordable alternative that lies within the budgetary constraints of many research labs.
While I agree that cDNA library sequencing could be a useful low-cost technique, I disagree with your assessment of the next-gen technologies.
454 reads are already pushing 250 base pairs, and I've heard estimates that the technology will top out at about 500 bp read lengths. If so, that would likely be sufficient length for new genome assembly. If I remember right, Sanger sequencing was just north of 500 bp read lengths during much of the Human genome project.
Even if Sanger sequencing can't be completely replaced, there's no reason why a hybrid method can't bring the cost down considerably. Instead of doing 3-6x Sanger coverage, you'd shoot for 1x Sanger coverage, which would be just enough to provide the scaffolds for anchoring next-gen reads, which provide your sequencing depth. If the genome requires finishing, you'd then do some targeted sequencing to close the gaps (just like they do now).
It's not just the short read length, Chris -- it's the lack of paired reads that really sinks 454 as a practical denovo technique. That being said, yes, there is hope for hybrid Sanger/454 projects (I've read several grant proposals which planned to do such a thing, although none went so far as only 1x Sanger coverage.)
Chris, Even if you use a hybrid method, how many individuals can you sequence on single grant? Will the entire grant be dedicated to sequencing the genome of a single species? What would the hybrid method cost? The simple fact that you can sequence a cDNA library so cheaply makes that type of project very enticing in a return on investment angle. You can do some really interesting science (and improve the study of a particular taxon greatly) using just cDNA.
At the American Society for Microbiology meeting I met and discussed the 454 with a sales rep. He said they were routinely getting reads of 300 bp, and that for microbes (bacteria at least) that you could do two different species simultaneously for a cost of about $10,000 in reagents (the machine, however, costs half a million). For microbiologists, at least, the days of writing a grant to sequence your bug are over. It's a minor expense.
It won't be long before the same will be true of larger genomes.
Well, the rep was doing what a rep does best -- hype his product. In no way is pure 454 ready for general denovo sequencing even of bacterial genomes, and the grant proposals I've read reflect that -- hybrid 454/Sanger projects are the "in thing" this year, and of course 454 is great for resequencing another strain of anthrax or something, but as far as I can tell, nobody outside the 454 sales department is seriously promoting denovo genome sequencing without any help from traditional Sanger methods.
How about this?
They got considerably shorter reads than what the sales rep told me (only around 100 bp) but managed to generate the bulk of the chromosome de novo. I've only skimmed the paper, but it sounds like they got stuck on the rDNA repeats.
What do you think, Jonathan?
Can't read the article (we don't subscribe and the paper isn't Open Access), but if they really assembled it without help from the existing A. baumannii sequences (not clear from the abstract), it is indeed impressive.
Personally I think 454/Roche sequencing is going to lower the cost of sequencing a microbial genome, but if you want to actually end up with a finished, high quality genome sequencing, one still needs Sanger/ABI sequencing to get the job done. It is not just the paired end issue but also sequencing quality. But neverthless, with all the new methods, the cost of de-novo sequencing a bacterial/archaeal genome will come down substantially in the next year or two.
In addition, certainly, some type of genome scanning where one has a reference genome and one is trying to identify major differernces in a close relative is already reasonably cheap for small genomes. But this notion that we have done the important organisms and now we must move on to comparative sequencing is completely silly. For bacteria and archaea, we have not come close to sampling diversity in terms of genome sequences. It is why we are starting a Genomic Encylcopedia project at JGI (I will write more about this on my blog but in short - we will sequence 100 bacterial/archaeal genomes this year selected primarily by their phylogenetic distance from other genomes). And for eukaryotes this is desperately needed too. The diversity of euks is barely touched by genome sequencing. If NHGRI wants to move on to other things fine, but there is still great value in getting high quality genomes from 100s if not 1000s of species of euks. And I am not convinced that the new methods really help much in this regard. So maybe when all the NHGRI centers ditch their ABI machines, we can make a new center to do de novo sequecing only using the cast off machines.
I can't comment on the sequencing technology, but I do have a soft spot for cDNAs. I think there's still a lot more to learn from even the ones we have sequenced.
One issue, though, is spatio-temporal expression. Maybe not so much in a few-celled organism, but for anything with tissue types and a bunch of developmental or stress/wounding stages there's so much you could miss. And I think there's a lot to learn there still.