Genome Annotation: There's Too Much Crap in GenBank

One of the exciting things about bacterial genomics in that, within a year, we'll definitely be in the era of the $1,500 bacterial genome, although that's probably an overestimate. This cost includes everything: labor, sequencing, genome assembly, and genome annotation. While sequencing is highly automated, and has been turned into a production process, akin to a factory, high quality genome annotation, until very recently, has not. Automated software gets about 95% of gene calls right, but the other five percent differs based on the algorithm used.

Unfortunately, this means that, instead of using computers, we need to use humans. Humans suck: we're very slow, we're sloppy, and we're subjective.

This sorry state of affairs exists, in large part, because there hasn't been a need to fix it--until about a year ago, annotation (figuring out which parts of the genome encode genes and what those genes might be) wasn't the rate-limiting step. Now it is.

This has led to various groups using automated gene prediction software. Unfortunately, not all groups have procedures in place to catch erroneous gene calls and so bad gene calls are made--I'm not talking about subtle differences here, but really obvious stuff that doesn't pass the interocular test* (E. coli genomes chock full of 12 amino acid long proteins). This is bad for people who want to pull these genomes out of GenBank and study them. But this also screws up gene identification of new genomes. High quality gene prediction processes rely on both ab initio rules and previously identified genes from other genomes.

So, garbage in, garbage out. It's reached the point for at least one project I'm involved in, where we've had to create an 'embargoed' list of genes that has been locked down, since there's a lot of crap gene calls. That might work for some species where we have a lot of genome and other information, but it will be less successful for those organisms that have relatively few genomes. With the flood of data coming (present and future), we have to move to systems where human oversight is limited (unless we want to have years of annotation backlog). We really need to clean out GenBank, but I'm not sure how to go about doing that.

An aside: An analogous problem exists with 16S rRNA databases: unless they're manually curated, a significant fraction of sequences are PCR artifacts.

*Interocular test--hits you right between the eyes...

More like this

I don't know if this is any consolation, but the student annotation projects, where students do functional and structural annotation of bacterial genomes have a pretty good track record.

I wrote about some of these in Genome Technology. I think having larger groups of people involved, and having faculty and database curators focus on quality control and error-checking would help move things along much quicker.

In my youth L. J. Savage told me about the "best statistical test in the world, the Interocular Traumatic Method." ;) It's nice to see the term still in use.

I think manual annotation is becoming obsolete. Except for a few model organisms, I think we should try to move to (1) evidence-based annotation based on RNA sequencing and/or mass spec for a significant subset of genomes, and (2) for other genomes, throw out all the old annotations and continuously replace them with new automated ones as the tools improve. I think RefSeq is actually doing some of #2 but I'm not sure how comprehensive it is. I haven't heard of any systematic efforts to do #1 but I think the costs are becoming reasonable.

In this view of the future, manual intervention will mostly be useful for reporting errors when a biologist stumbles across one, so that the tools improve, and for improvements to databases with putative gene functions (e.g., KEGG, SEED, MetaCyc).

so on the previous post we were worrying about the lack of new jobs and on this one we worry about not having enough folks out there annotating......just have to convince them that has the gold that this is something they should support.

By MissouriMule (not verified) on 07 Jan 2010 #permalink

My own take on this is that Genbank (as opposed to RefSeq) has always been this way, essentially by design in many ways. (I'm referring to the concept that maintaining entries lies with who submitted the entry.) Sifting out the rubbish (garbage) has always be part of using Genbank.

Grant, comment 6, has it right. GenBank is an archival resource. Thus, the general idea at NCBI is not to change the archive, each entry is a view at the time it is entered. There does continue to be a growing challenge in that it is not clear when items in the archive change because they are either completely incorrect, or updated. In part this is related to the historical view that scientists submitting the data were the authors of the data and thus the data's owners who have to approve of the changes. This view gets blurry as that world changes to higher data production, community developed knowledge, and automated annotation.

To Morgan's point (comment 4), evidence-based annotation costs money. But, I'd ask which costs more? Manual labor-based curation, or automation systems that can update annotations with the latest evidence. The real challenge here becomes updating the annotations. In the human genome world it took UCSC about 8 months to move from build 36 to build 37 (hg18 to hg19) because of the work involved in re-annotating all of the tracks to the most recent coordinates.

What about a wiki based model? RNA sequencing has also become an effective way of producing evidence based annotation. If the annotation had different tracks, each track based on the different method used - algorithms, RNA sequencing, transcription factor binding, ...

The community should be using automated data profiling tools, the same tools and methods that are used in commercial data warehousing projects. We introduced them to one of our PIs recently and his reaction was - WOW that's easy - saves 90% of the time required. Now we just need to assemble the team of students (easy)...and get the money to buy the software (4K/seat - hard).

The annotation problem with genbank goes beyond short sequences and wrong predictions. In the case of virus sequences 60% can not be traced to a particular geographical location and 465 species of viruses do not agree with taxonomic standards that are generated by International Committee on Viral Taxonomy.

So the House has passed the "give us that damn money back, AIG" bill by a wide margin. ... Because even assuming this thing gets signed into law and that the