Genome Annotation: There's Too Much Crap in GenBank

By mikethemadbiologist on January 7, 2010.

One of the exciting things about bacterial genomics in that, within a year, we'll definitely be in the era of the $1,500 bacterial genome, although that's probably an overestimate. This cost includes everything: labor, sequencing, genome assembly, and genome annotation. While sequencing is highly automated, and has been turned into a production process, akin to a factory, high quality genome annotation, until very recently, has not. Automated software gets about 95% of gene calls right, but the other five percent differs based on the algorithm used.

Unfortunately, this means that, instead of using computers, we need to use humans. Humans suck: we're very slow, we're sloppy, and we're subjective.

This sorry state of affairs exists, in large part, because there hasn't been a need to fix it--until about a year ago, annotation (figuring out which parts of the genome encode genes and what those genes might be) wasn't the rate-limiting step. Now it is.

This has led to various groups using automated gene prediction software. Unfortunately, not all groups have procedures in place to catch erroneous gene calls and so bad gene calls are made--I'm not talking about subtle differences here, but really obvious stuff that doesn't pass the interocular test* (E. coli genomes chock full of 12 amino acid long proteins). This is bad for people who want to pull these genomes out of GenBank and study them. But this also screws up gene identification of new genomes. High quality gene prediction processes rely on both ab initio rules and previously identified genes from other genomes.

So, garbage in, garbage out. It's reached the point for at least one project I'm involved in, where we've had to create an 'embargoed' list of genes that has been locked down, since there's a lot of crap gene calls. That might work for some species where we have a lot of genome and other information, but it will be less successful for those organisms that have relatively few genomes. With the flood of data coming (present and future), we have to move to systems where human oversight is limited (unless we want to have years of annotation backlog). We really need to clean out GenBank, but I'm not sure how to go about doing that.

An aside: An analogous problem exists with 16S rRNA databases: unless they're manually curated, a significant fraction of sequences are PCR artifacts.

*Interocular test--hits you right between the eyes...

More like this

I don't know if this is any consolation, but the student annotation projects, where students do functional and structural annotation of bacterial genomes have a pretty good track record.

I wrote about some of these in Genome Technology. I think having larger groups of people involved, and having faculty and database curators focus on quality control and error-checking would help move things along much quicker.

In my youth L. J. Savage told me about the "best statistical test in the world, the Interocular Traumatic Method." ;) It's nice to see the term still in use.

I think manual annotation is becoming obsolete. Except for a few model organisms, I think we should try to move to (1) evidence-based annotation based on RNA sequencing and/or mass spec for a significant subset of genomes, and (2) for other genomes, throw out all the old annotations and continuously replace them with new automated ones as the tools improve. I think RefSeq is actually doing some of #2 but I'm not sure how comprehensive it is. I haven't heard of any systematic efforts to do #1 but I think the costs are becoming reasonable.

In this view of the future, manual intervention will mostly be useful for reporting errors when a biologist stumbles across one, so that the tools improve, and for improvements to databases with putative gene functions (e.g., KEGG, SEED, MetaCyc).

Morgan,

I agree, evidence-based annotation is the way to go, but, of course, that costs money...

so on the previous post we were worrying about the lack of new jobs and on this one we worry about not having enough folks out there annotating......just have to convince them that has the gold that this is something they should support.

My own take on this is that Genbank (as opposed to RefSeq) has always been this way, essentially by design in many ways. (I'm referring to the concept that maintaining entries lies with who submitted the entry.) Sifting out the rubbish (garbage) has always be part of using Genbank.

Aware of opportunities in the hands of the securities
Want to take away HerÅeyini
That insidious

Ditto on using RefSeq - have you tried restricting to it?

Of course, it is not perfect either, but is a more reasonable task to clean up.

Grant, comment 6, has it right. GenBank is an archival resource. Thus, the general idea at NCBI is not to change the archive, each entry is a view at the time it is entered. There does continue to be a growing challenge in that it is not clear when items in the archive change because they are either completely incorrect, or updated. In part this is related to the historical view that scientists submitting the data were the authors of the data and thus the data's owners who have to approve of the changes. This view gets blurry as that world changes to higher data production, community developed knowledge, and automated annotation.

To Morgan's point (comment 4), evidence-based annotation costs money. But, I'd ask which costs more? Manual labor-based curation, or automation systems that can update annotations with the latest evidence. The real challenge here becomes updating the annotations. In the human genome world it took UCSC about 8 months to move from build 36 to build 37 (hg18 to hg19) because of the work involved in re-annotating all of the tracks to the most recent coordinates.

What about a wiki based model? RNA sequencing has also become an effective way of producing evidence based annotation. If the annotation had different tracks, each track based on the different method used - algorithms, RNA sequencing, transcription factor binding, ...

The community should be using automated data profiling tools, the same tools and methods that are used in commercial data warehousing projects. We introduced them to one of our PIs recently and his reaction was - WOW that's easy - saves 90% of the time required. Now we just need to assemble the team of students (easy)...and get the money to buy the software (4K/seat - hard).

The annotation problem with genbank goes beyond short sequences and wrong predictions. In the case of virus sequences 60% can not be traced to a particular geographical location and 465 species of viruses do not agree with taxonomic standards that are generated by International Committee on Viral Taxonomy.

So the House has passed the "give us that damn money back, AIG" bill by a wide margin. ... Because even assuming this thing gets signed into law and that the

Aware of opportunities in the hands of the securities
Want to take away HerÅeyini
That insidious

I agree, evidence-based annotation is the way to go, but, of course, that costs money...

We introduced them to one of our PIs recently and his reaction was

Maybe it's will be better if we scrap all the old data and make new data from zero

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

Program Announcement: I'm Moving

September 1, 2011

I've dropped some hints in the past that my relationship with ScienceBlogs would be...altered. Well, I've decided to leave. Mostly, it had to do with the issue of pseudonymity, although I'm very excited to hang out my own shingle once again. I don't want to rehash the issue of pseudonymity,…

Note to Unions: This Is Not How You Build a Coalition

September 1, 2011

The old saw that 'we hang together or we get hung separately' is a perfect description of how the left has disintegrated into irrelevance. Too often, groups will focus on modest gains for their own narrow constituency, while selling out other allies. Over the long term, each component of the…

Links 8/31/11

August 31, 2011

Links for you. Science: Underground river 'Rio Hamza' discovered 4km beneath the AmazonWhat do accommodationists do about creationist politicians?I've Been Told You Can Get Flu From the Flu Shot: False!Federal Work Suspension of Leading Arctic Scientist Ended as Investigation of His Investigators…

Meet the New New Math, Same As the Old New Math? What We Can Learn from Finland

August 31, 2011

Recently, The New York Times published an op-ed calling for curricular changes in K-12 math education: Today, American high schools offer a sequence of algebra, geometry, more algebra, pre-calculus and calculus (or a "reform" version in which these topics are interwoven). This has been codified by…

Links 8/30/11

August 30, 2011

Links for you. Another Scientist Calls Out Sen. Coburn's Misleading, Juvenile "Report"XMRV: ITS EVERYWHERE! UUUUUGH! ITS IN MY RACCOON WOUNDS! AND MY QIAGEN COLUMNS!Coulter Goes All Science-y in Bid to Disprove EvolutionYet another bad day for the anti-vaccine movement 2011Antibiotics: Killing Off…