Biologists vs. the Age of Information

It's pretty common these days to pick up an issue of Science or Nature and see people ranting about GenBank (1). Many of the rants are triggered, at least in part, by a wide-spread misunderstanding of what GenBank is and how it works. Perhaps this can be solved through education, but I don't think that's likely. People from the NCBI can explain over and over again that some of the sequence databases in GenBank are meant to be an archival resource (2), and define the term "archive," but that's not going to help.

Confusion about database content and oversight is widespread in this community with good reason.

Why are researchers confused?

Let's begin with GenBank - GenBank is the main database of nucleotide sequences at the NCBI. Sequence data are submitted to GenBank by researchers or sequencing centers. If mistakes are found, the information in the records can be updated by the submitters or by third parties if the corrected versions are published. This correction activity doesn't always happen though, and the requirement for third party annotations to be published makes it pretty unlikely that anyone will submit small corrections to a sequence.

This is why we see these kinds of quotes from Steven Salzberg (3):

So you think that gene you just retrieved from GenBank [1] is correct? Are you certain? If it is a eukaryotic gene, and especially if it is from an unfinished genome, there is a pretty good chance that the amino acid sequence is wrong. And depending on when the genome was sequenced and annotated, there is a chance that the description of its function is wrong too.

Okay, so for the moment we'll accept that GenBank is an archival resource and some of the data may be wrong. Aren't some of the databases, like RefSeq, curated?

Right. Some databases are curated; some are not.

Finding out if a database is curated or not, requires a bit of work. Most of the people I know who use BLAST for example, don't know which database they're querying anyway, much less if it's curated or not.

But wait, there's a question a mark next to the database selection on the blastn search page.

Would this give us the answer?

Well, this is what we get. Does this look like an answer? Does this tell me if a database is "library of archival information" or an entity that gets updated?

i-7a4c3f9a6744bc278642a4f1a3da2d6e-databases_blast.png


Not to me. In most definitions, all we get is a longer name instead of the abbreviation.

That's not to say the information about curation levels doesn't exist, if you look thoroughly and read the chapters in the handbook, you'll find out that the Reference Sequence Project does contain sequences that get updated and reviewed. You can find the information if you know to go seek it out and you know how to look.

But it's still not easy.

If we group some of the nucleotide and protein sequence databases together, there are at least 35 different databases at the NCBI. Every one of the databases operates somewhat differently, is curated (or not) to different extents, updated at different rates, and each one has it's own idiosyncratic nuances. Even the Entrez search engine appears somewhat differently depending on which database you're using. It's no wonder that people get confused about which rules apply to which one.

Now, let's consider the rest of the world. The 2008 Database issue of Nucleic Acids Research lists more than 1078 databases, each with their own rules and levels of curation (4). It makes my head spin.

How do or do those databases get updated?

The other day I found some outdated references in the Gene database at the NCBI (Some GeneRIF citations were in the reference for the wrong gene.) I would expect to find this kind of problem if the annotations were added by computers since computers, and most researchers, would assume that, if the researchers called a mutation "TaqI DRD2" that the mutation would map in DRD2.

I did find some information on how to fix the problem and submit corrections to the Gene database. But, with 31 mistaken entries, I realized that this project would take way too long and since I don't work in this area, I don't have much incentive to volunteer.

i-1d641eb6df4570704be33f02894710f7-biology_databases.png

It's a community problem
So what do we do? Do we care if the database information is up-to-date? If so, who should be responsible for the updates?

I'm sure some people would like the NCBI to be the final authority and just fix everything but I don't think that's very realistic.

Other people have proposed that wikis are the answer. Maybe they're right, but I really wonder if researchers would be any better at updating wikis than they are at updating information in places like the NCBI.

Well, dear readers, what do you think? Does GenBank need to be fixed? Do we just need more alternatives? Does it even matter?

Update: Deepak has some interesting ideas also in "Thinking about biological resources"

References:

1. Bidartondo, M. "Preserving Accuracy in GenBank" Science 21 March 2008: Vol. 319. no. 5870, p. 1616.

2. Pennisi, Elizabeth. "Proposal to 'Wikify' GenBank Meets Stiff Resistance." Science 21 March 2008: Vol. 319. no. 5870, pp. 1598 - 1599.

3. Salzberg, S. "Genome re-annotation: a wiki solution?" Genome Biology 2007, 8:102.

4. Galperin, M. "The Molecular Biology Database Collection: 2008 Update." Nucleic Acids Research 2008, Vol. 36, Database issue D2-D4

More like this

From my limited experience of bioinformatics databases, they strike me as being like a kind of elaborate Wikipedia, but without the standards. I mean, this has to be bad science, if you don't even know how valid your data is. Garbage in, garbage out...

Two things in particular strike me.

1. The lack of some sort of cohesion in the field - as evidenced by the ridiculous array of different DBs with wildly differing standards - is ridiculous.

2. The vast majority of biologists are far more limited in their computer skills than they think they are, and this applies especially to many of those starting and maintaining databases.

Does GenBank need to be fixed? Yes, at least in the sense that sequences which are clearly proven to be erroneous (such as chimeric 16S sequences) need to be marked as such. I wouldn't necessarily remove them, because they can still serve useful (such as being test samples for chimeric sequence detectors), but they need to be re-annotated.

My recommendation? Have GenBank editors, members of the scientific community, who field reports on sequences from the scientific community and make determinations as to the "final call" on a particular sequence. This process shuold be non-anonymous. Someone spots a problem, they file a report. The report is handled by the editor to contacts the submitter. The submitter can then defend their submission. The editor gets to make a final call. This way it doesn't fall on the GenBank staff to handle everything, and also gives people in the community a chance to put something else on their CV.

Frankly, every time I see a wet-lab researcher complain about the state of these databases, I get a little pissed. The problem isn't the databases, it's the biology-wide lack of commitment to funding solid bioinformatics programs.

I mean, have you ever tried to recreate someone else's analysis work from a paper? With all the different formats, lack of parameters, sloppy and unpublished code, and parameters they leave out, it's damn near impossible. If people were as sloppy when describing their bench work, the paper would be laughed at.

Right now, a huge proportion of computational work is done as side work by grad students or post-docs with little training in computer science. This is because the NIH and other funding agencies won't commit money to building the infrastructure to support bioinformatics endeavors.

I think that for every 5 biology grants the NIH puts out, they should reserve one more for computational groups who can build platforms and tools for analysis and data management, and inject some sanity into this field.

TomJoe: I like that idea. Temporary editor positions could be analogous to being a program officer at the NSF, or serving on a study section or being on committee for reviewing grant proposals.

Chris: I don't think this is a problem that computer science can address. It doesn't seem likely to me that classes in computer science will make people more conscientious about updating information.

I agree though about the sloppy way that computer work gets described in publications, but that's another blog post. :-)

Perhaps you're right that some kind of computer course could be helpful for biologists. We would certainly have more realistic expectations of what can and can't be done by computers.

This is a subject that I've been concerned about for 20 years. Check out my blog posting on Sandwalk where I describe my experience working with GenBank as a curator.

Let's face facts. Curation ain't gonna happen. It's way too expensive.

I'm trying to type while laughing, it may not come out right...

Have y'all ever done curation? I have. I was actually entering my own paper to a database that I was developing as part of the bioinformatics team. I was horrified by the key points I had left out of the paper--stuff that I actually probably didn't even know, but that made databasing it hell. Mouse strain? I don't know--we got them donated from a company that was finished with them. I might argue that for the particular northern blot it didn't matter much--I was just showing a 9kb transcript. But it is still a missing point. Other things that were important to the database weren't in the paper either.

And then I read other people's stuff. It was even worse with missing information.

I blame the biologists :) And it was ME.

No, seriously--there's plenty of blame to go around. And there's plenty of funding tension between people on big expensive database projects and small labs. As funding gets tighter both hurt.

But then people will try to do creative ways get more/better information in the databases--but that could be fraught with danger as well. http://www.openhelix.com/blog/?p=237

I'm looking forward to the outcome of this TAIR effort, but I'm not expecting much. Community annotation in general is not widely adopted, and the quality varies so much.

I think you need professionals--trained curators with backgrounds in the appropriate bench science, with institutional memories for the database development and issues--to do it right.

I was horrified by the key points I had left out of the paper--stuff that I actually probably didn't even know, but that made databasing it hell. Mouse strain? I don't know--we got them donated from a company that was finished with them.

The results from some experiments will be entirely strain dependent. I'm surprised the reviewers didn't ask/demand for this information.

I've done curation before, and it's a demanding job ... which is why it's important that the individuals overseeing the curation process be experts in their field. The same thing works for journal editing and grant reviewing. It should work just as well in this scenario. Obviously, the entire process should be documented ... in my example above the challenge to the information, the response, the final decision ... should all be documented publicly so people are aware (and can add any new insights) of what has preceded.

It's apparent that I missed making my point clear. I don't advocate hiring more reviewers, I think there should be a better way to get the community involved.

No one is as much an expert in a field as the people who work in it. They should feel some sense of ownership and have some kind of vested interest in seeing that the databases are accurate and up-to-date.

This isn't a computational problem. It's a problem of getting community participants involved in sharing and contributing to the knowledge base of the greater community.

Speaking as a microbial genome annotator/curator, it's all about the $$$$. U.S. funding agencies consider automated annotation with, at best, one round of human annotation, 'good enough' for a genome project. They won't typically fund a grant solely for updating old genome annotation; and if periodic re-annotation is included as a budget line item in a grant, that line won't typically make it through the review process. I'd love to see this sad state of affairs change and I'm glad it's getting some spotlight attention in scienceblogs.

By Steven Sullivan (not verified) on 23 Jun 2008 #permalink

Community curation is a lovely thought. It has almost always been a failure in actual projects. Sometimes there are a few dedicated folks who stick it out. More often the founding grad student or postdoc on the project moves on.... But nobody has time for this with all the other things pulling at them--teaching, the money chase, life, oh--and the actual research.

A decade ago, on a project where we were tossing the idea around, it was already clear that it doesn't work. There is no credit given to "community curation" by tenure committees. It does not have a spot on the NIH application.

I wish it wasn't that way, but it is.

Well, at one time, it was difficult even to get people to submit their data. Now, with the journals requiring accession numbers, sequence submission has become a given.

Are there carrots (or sticks) that could make community curation more rewarding?

Well, at one time, it was difficult even to get people to submit their data. Now, with the journals requiring accession numbers, sequence submission has become a given.

Yeah--but didn't that get us GenBank? Isn't that the issue?

GenBank mostly needs better documentation so that its users can adjust their expectations accordingly.

Yeah, everybody reads documentation :)

And I do think it is the issue. Incorrect information (or information that was found later to be incorrect--I don't mean it was intentional) is propagated all over the place based on these community submissions. I actually worked with a guy who accidentally misspelled the protein name in a field in his submission. He specifically didn't want it corrected because he knew that he could do a search with the wrong spelling and get back the record he wanted.

It will be impossible to fine-tooth-comb every submission, especially relying on people who don't aren't trained and don't understand Gene Ontology or any other standards that databases are relying upon.

I think the community can do quite a bit if they want to and if there's a community ethic around the notion of being responsible and participating in the process.

Communities of people have built some amazing things - the Oxford English Dictionary was compiled through the work of untrained amateurs- and Wikipedia - while not perfect, is still a pretty useful resource. Surely, scientists with Ph.D.s can do as well or almost as well as amateur curators - if there's some support and encouragement for them to do so.

I'm sure they are capable. I just don't think it will happen. And I have seen it fail over and over. The number of dead wikis and entries on topics that you would *think* would be hot is rather depressing. Multiple people have flat-out told me that they tried a wiki, but nobody contributed....

It requires leadership. It requires management. It requires software. It requires training. It requires commitment. It requires effort. It requires time. It would be great to see that come on a volunteer basis. But I don't think it is likely. I think the quality, consistency, and maintenance that we yearn for only comes from a funded project with trained professionals.

It also requires nagging. And after awhile, that pisses some nag-ees off. ;>

That said, the Gene Ontology consortium is an annotation-related 'community effort' of scientists that works (probably because there is actual funding for GO work)

http://www.geneontology.org/GO.consortiumlist.shtml

By Steven Sullivan (not verified) on 24 Jun 2008 #permalink

Community projects are great if you want to create an exhaustive database of plot devices in Doctor Who episodes. For this, you'd end up with a result that combined the management of uncurated databases, with the accuracy of Wikipedia.

It needs money, pure and simple. It needs dedicated staff who can verify data, phone up researchers, respond to feedback, ensure standards are being met and so on and so forth.

Your post starts "It's pretty common these days to pick up an issue of Science or Nature and see people ranting about GenBank (1)." You cite Science but not Nature. When did Nature do this? I am one of the editors and I can't recall. Thanks!