A couple of weeks ago I suggested that the National Human Genome Research Institute (NHGRI) would no longer be funding de novo genome sequencing projects via white papers. They appear to be shifting their focus to resequencing projects to study variation (ie, this) and take a closer look at well studied organisms (ie, ENCODE, which now has Drosophila and Caenorhabditis versions). But the distribution of genomic resources is extremely biased towards a few species. What should those researchers who work on organisms without genome sequences do if they can’t solicit funds from the NHGRI?
As pointed out in the comments of my previous post, de novo sequencing of microbial genomes is still going strong. This research is heavily funded by the Joint Genome Institute (JGI), so a change in focus at the NHGRI won’t affect microbiologists. In fact, this is pretty much true for anyone interested in sequencing a small genome (bacterial, archael, or eukaryotic). A moderate research grant from JGI or the National Science Foundation (NSF) would fund a de novo whole genome sequencing project for many (most?) species (sorry for the bias toward US funding agencies, but those are the ones with which I am familiar). One recent example (from outside the US) is reported in this paper from Ken Wolfe’s group in which they analyze gene loss following whole genome duplication in a yeast species that diverged from S. cerevisiae soon after a whole genome duplication event. The authors were able to sequence and analyze a new yeast genome (Kluyveromyces polysporus) in such a matter of fact way because of the small size of these yeast genomes and the availability of multiple other yeast genomes for comparison. In fact, it’s not obvious from the title of the paper that they are reporting the whole genome sequence of a new species — that’s how ho-hum de novo sequencing projects have become in some taxa.
Resequencing projects and moderately funded de novo whole genome sequencing projects are only possible for species from well studied taxa or those with small genomes. How should one go about soliciting funds to sequence the genome of a poorly studied species or one with a large genome (presumably due to a large amount of repetitive DNA)? Some may argue that next generation sequencing technologies (eg, 454 or Solexa) will allow for de novo sequencing of whole eukaryotic genomes in the near future. I disagree. These new technologies can be used for resequencing of large genomes (for instance, James Watson) or sequencing genomes from well studied taxa for which there are closely related species with available whole genome sequences to aid in the assembly (the Neanderthal projects come to mind). But the read lengths generated from 454 and Solexa simply aren’t long enough to go after a large genome from a species for which there is not available sequence from a close relative to aid in assembly.
Given that the funds are not available for de novo shotgun sequencing of large genomes via the Sanger technique, and we’re not quite to a point where the new technologies can replace the Sanger method, what can we do to study the genomes of these “non-sequencable” species? My solution: use 454 to sequence cDNA libraries. First of all, this isn’t really my solution, but one that I have been hearing from a few different labs that work on “non-sequencable” species. Second of all, yes, I realize that this is not a great substitute for a whole genome project. A sequenced cDNA library will only provide information about the transcribed sequences in a genome (and the annotation will be biased toward protein coding sequences), but this is far superior to no sequence. Furthermore, funding for a sequenced cDNA library for a single species can be incorporated in a modest grant proposal, and said proposal could even include funds to sequence multiple individuals from a species or multiple closely related species.
With a large fraction of protein coding sequences, a fair bit of analysis can be performed. The scope of a study can be greatly improved if polymorphism data or sequences from closely related species are generated. From what I have heard, the cost of sequencing a cDNA library using 454 is not much more than $10,000, so comparative transcriptomics is quite feasible without much of an investment. Additionally, these sequences would provide a valuable resource for researchers whose study organisms lack good sequences from which to develop molecular markers. A moderate front end investment to sequence an entire cDNA library for a poorly studied organism could yield substantial benefits for future molecular work in that species.
Keep in mind, I do not propose that 454 sequencing of cDNA libraries is the finish line for “non-sequencable” species. Rather, we are currently caught in a holding pattern during which the funding and technology are not available for de novo whole genome sequencing of these species. Once the cost and quality of next generation sequencing technologies reach the appropriate levels to allow for sequencing of whole genomes from any and all species, then we can make the switch to those technologies. Until then, sequencing cDNA libraries offers an affordable alternative that lies within the budgetary constraints of many research labs.