It’s pretty common these days to pick up an issue of Science or Nature and see people ranting about GenBank (1). Many of the rants are triggered, at least in part, by a wide-spread misunderstanding of what GenBank is and how it works. Perhaps this can be solved through education, but I don’t think that’s likely. People from the NCBI can explain over and over again that some of the sequence databases in GenBank are meant to be an archival resource (2), and define the term “archive,” but that’s not going to help.
Confusion about database content and oversight is widespread in this community with good reason.
Why are researchers confused?
Let’s begin with GenBank – GenBank is the main database of nucleotide sequences at the NCBI. Sequence data are submitted to GenBank by researchers or sequencing centers. If mistakes are found, the information in the records can be updated by the submitters or by third parties if the corrected versions are published. This correction activity doesn’t always happen though, and the requirement for third party annotations to be published makes it pretty unlikely that anyone will submit small corrections to a sequence.
This is why we see these kinds of quotes from Steven Salzberg (3):
So you think that gene you just retrieved from GenBank  is correct? Are you certain? If it is a eukaryotic gene, and especially if it is from an unfinished genome, there is a pretty good chance that the amino acid sequence is wrong. And depending on when the genome was sequenced and annotated, there is a chance that the description of its function is wrong too.
Okay, so for the moment we’ll accept that GenBank is an archival resource and some of the data may be wrong. Aren’t some of the databases, like RefSeq, curated?
Right. Some databases are curated; some are not.
Finding out if a database is curated or not, requires a bit of work. Most of the people I know who use BLAST for example, don’t know which database they’re querying anyway, much less if it’s curated or not.
But wait, there’s a question a mark next to the database selection on the blastn search page.
Would this give us the answer?
Well, this is what we get. Does this look like an answer? Does this tell me if a database is “library of archival information” or an entity that gets updated?
Not to me. In most definitions, all we get is a longer name instead of the abbreviation.
That’s not to say the information about curation levels doesn’t exist, if you look thoroughly and read the chapters in the handbook, you’ll find out that the Reference Sequence Project does contain sequences that get updated and reviewed. You can find the information if you know to go seek it out and you know how to look.
But it’s still not easy.
If we group some of the nucleotide and protein sequence databases together, there are at least 35 different databases at the NCBI. Every one of the databases operates somewhat differently, is curated (or not) to different extents, updated at different rates, and each one has it’s own idiosyncratic nuances. Even the Entrez search engine appears somewhat differently depending on which database you’re using. It’s no wonder that people get confused about which rules apply to which one.
Now, let’s consider the rest of the world. The 2008 Database issue of Nucleic Acids Research lists more than 1078 databases, each with their own rules and levels of curation (4). It makes my head spin.
How do or do those databases get updated?
The other day I found some outdated references in the Gene database at the NCBI (Some GeneRIF citations were in the reference for the wrong gene.) I would expect to find this kind of problem if the annotations were added by computers since computers, and most researchers, would assume that, if the researchers called a mutation “TaqI DRD2” that the mutation would map in DRD2.
I did find some information on how to fix the problem and submit corrections to the Gene database. But, with 31 mistaken entries, I realized that this project would take way too long and since I don’t work in this area, I don’t have much incentive to volunteer.
It’s a community problem
So what do we do? Do we care if the database information is up-to-date? If so, who should be responsible for the updates?
I’m sure some people would like the NCBI to be the final authority and just fix everything but I don’t think that’s very realistic.
Other people have proposed that wikis are the answer. Maybe they’re right, but I really wonder if researchers would be any better at updating wikis than they are at updating information in places like the NCBI.
Well, dear readers, what do you think? Does GenBank need to be fixed? Do we just need more alternatives? Does it even matter?
Update: Deepak has some interesting ideas also in “Thinking about biological resources”
1. Bidartondo, M. “Preserving Accuracy in GenBank” Science 21 March 2008: Vol. 319. no. 5870, p. 1616.
2. Pennisi, Elizabeth. “Proposal to ‘Wikify’ GenBank Meets Stiff Resistance.” Science 21 March 2008: Vol. 319. no. 5870, pp. 1598 – 1599.
3. Salzberg, S. “Genome re-annotation: a wiki solution?” Genome Biology 2007, 8:102.
4. Galperin, M. “The Molecular Biology Database Collection: 2008 Update.” Nucleic Acids Research 2008, Vol. 36, Database issue D2-D4