Bio Databases 2016

Someone missed the memo. Over the past year, news and presentations by NIH leaders like Philip Bourne have communicated that the proliferation biologically focused databases is unsustainable. However, unlike last year, where the number of databases tracked by Nucleic Acids Research (NAR) dropped by three databases, 2015's net growth was 136.

number of dbs 2016Counting databases is hard

As summarized in the database issue's introduction, Rigden, Fernández-Suarez, and Galperin tell us this year's issue (the 23rd annual) has 178 papers. 62 papers describe new databases, 95 provide updates, and 17 are updates of databases that were published elsewhere (as an aside, later in the paper they say 15 are updates of databases published elsewhere). Together, these (62+15) make 77 new entries in the NAR online Molecular Biology Database Collection; and a loss of 23 databases, removed because they were obsolete, for a net gain of 54 new databases. How did I calculate 136 new databases? In the abstract the authors indicated that the current total tracked is 1685, last year I calculated that there were 1549 in the archive. Perhaps the difference (62 vs. 136) results from adding databases that were not publication.

Bio Databases have an uncertain future

As noted above, a conversation is emerging about the value and role of bio databases in research. Many in the community acknowledge that the specialized information contained in these databases is valuable as each resource captures a unique aspect of biology. Just by browsing the database names one gets an amazing overview molecular biology. For example, this year's new resources included databases on recently discovered non-coding RNA, CRISPR sequences for gene editing in zebrafish, a super enhancer archive, and many others.

However, once these databases are created, they need to be maintained and funds directed for new research are often out of alignment with that goal. Indeed, NAR noted that 121 databases were non-responsive in curation checks prompting the removal of 23 resources that were deemed obsolete.

For a funding agency, the long term database maintenance has real costs. According to a recent news report (additional reading, below), nine databases (model organisms, OMIM, and others) supported by the National Human Genome Research Institute (NHGRI) cost nearly $30 million per year to operate and NHGRI has stated that they need to have new funding models in four years.

When the 50 largest NIH-supported resources (not counting GenBank and other National Library of Medicine [NLM]-supported databases) are considered, this expenditure increases to $110 million. While these costs are seemingly high, the real cost needs to be expressed in the context of the overall budget, otherwise it's just political theater.

So how much does NIH spend on databases?  Presently, NIH's budget is $30+ billion, with $33 billion requested for 2016.  $110MM is less than 0.4% of the total. If we include all of NLM's ~$394MM budget request, which includes $54MM for the National Centers of Biotechnology Information (NCBI: home of GenBank]), the total expenditure on information resources as a percentage of the total is still less than 2%. NHGRI's portion of its ~$515MM requested budget dedicated to the nine databases is higher at nearly 6%.

Evaluating the value of databases and their funding models is not a simple problem. Questions about who supports a resource, be it NIH, the consumers, or some other business model, need to consider a diverse group of stakeholders including the public (now a major consumer of information at NCBI). We also need to remember that the databases contain data supported with public funds, and many business models - like subscriptions - will make these resources closed, which is orthogonal to NIH's stated view on public access to government funded science.

Biology has become an information science. The days of single gels, scattered observations, limited insights, and eureka moments, are largely over. Hence, databases form a core element of research and public knowledge. Those Clusterd-Reguarly-Interspersed-Short-Palindromic-Repeats that are transforming molecular biology were only discoverable because we had easily accessible data in databases.

NIH continues to spend significantly on new endeavors focused on data production and use but how much of that expenditure should go into managing resources is a legitimate topic of conversation and debate.

Is two cents too much to pay?


Additional Reading:

Funding for key data resources in jeopardy

http://www.slideshare.net/pebourne/the-nih-as-a-digital-enterprise-impl…

Tags

More like this