One of the exciting things about bacterial genomics in that, within a year, we’ll definitely be in the era of the $1,500 bacterial genome, although that’s probably an overestimate. This cost includes everything: labor, sequencing, genome assembly, and genome annotation. While sequencing is highly automated, and has been turned into a production process, akin to a factory, high quality genome annotation, until very recently, has not. Automated software gets about 95% of gene calls right, but the other five percent differs based on the algorithm used.
Unfortunately, this means that, instead of using computers, we need to use humans. Humans suck: we’re very slow, we’re sloppy, and we’re subjective.
This sorry state of affairs exists, in large part, because there hasn’t been a need to fix it–until about a year ago, annotation (figuring out which parts of the genome encode genes and what those genes might be) wasn’t the rate-limiting step. Now it is.
This has led to various groups using automated gene prediction software. Unfortunately, not all groups have procedures in place to catch erroneous gene calls and so bad gene calls are made–I’m not talking about subtle differences here, but really obvious stuff that doesn’t pass the interocular test* (E. coli genomes chock full of 12 amino acid long proteins). This is bad for people who want to pull these genomes out of GenBank and study them. But this also screws up gene identification of new genomes. High quality gene prediction processes rely on both ab initio rules and previously identified genes from other genomes.
So, garbage in, garbage out. It’s reached the point for at least one project I’m involved in, where we’ve had to create an ‘embargoed’ list of genes that has been locked down, since there’s a lot of crap gene calls. That might work for some species where we have a lot of genome and other information, but it will be less successful for those organisms that have relatively few genomes. With the flood of data coming (present and future), we have to move to systems where human oversight is limited (unless we want to have years of annotation backlog). We really need to clean out GenBank, but I’m not sure how to go about doing that.
An aside: An analogous problem exists with 16S rRNA databases: unless they’re manually curated, a significant fraction of sequences are PCR artifacts.
*Interocular test–hits you right between the eyes…