Bio Databases 2016

By finchtalk on February 16, 2016.

Someone missed the memo. Over the past year, news and presentations by NIH leaders like Philip Bourne have communicated that the proliferation biologically focused databases is unsustainable. However, unlike last year, where the number of databases tracked by Nucleic Acids Research (NAR) dropped by three databases, 2015's net growth was 136.

number of dbs 2016 Counting databases is hard

As summarized in the database issue's introduction, Rigden, Fernández-Suarez, and Galperin tell us this year's issue (the 23rd annual) has 178 papers. 62 papers describe new databases, 95 provide updates, and 17 are updates of databases that were published elsewhere (as an aside, later in the paper they say 15 are updates of databases published elsewhere). Together, these (62+15) make 77 new entries in the NAR online Molecular Biology Database Collection; and a loss of 23 databases, removed because they were obsolete, for a net gain of 54 new databases. How did I calculate 136 new databases? In the abstract the authors indicated that the current total tracked is 1685, last year I calculated that there were 1549 in the archive. Perhaps the difference (62 vs. 136) results from adding databases that were not publication.

Bio Databases have an uncertain future

As noted above, a conversation is emerging about the value and role of bio databases in research. Many in the community acknowledge that the specialized information contained in these databases is valuable as each resource captures a unique aspect of biology. Just by browsing the database names one gets an amazing overview molecular biology. For example, this year's new resources included databases on recently discovered non-coding RNA, CRISPR sequences for gene editing in zebrafish, a super enhancer archive, and many others.

However, once these databases are created, they need to be maintained and funds directed for new research are often out of alignment with that goal. Indeed, NAR noted that 121 databases were non-responsive in curation checks prompting the removal of 23 resources that were deemed obsolete.

For a funding agency, the long term database maintenance has real costs. According to a recent news report (additional reading, below), nine databases (model organisms, OMIM, and others) supported by the National Human Genome Research Institute (NHGRI) cost nearly $30 million per year to operate and NHGRI has stated that they need to have new funding models in four years.

When the 50 largest NIH-supported resources (not counting GenBank and other National Library of Medicine [NLM]-supported databases) are considered, this expenditure increases to $110 million. While these costs are seemingly high, the real cost needs to be expressed in the context of the overall budget, otherwise it's just political theater.

So how much does NIH spend on databases? Presently, NIH's budget is $30+ billion, with $33 billion requested for 2016. $110MM is less than 0.4% of the total. If we include all of NLM's ~$394MM budget request, which includes $54MM for the National Centers of Biotechnology Information (NCBI: home of GenBank]), the total expenditure on information resources as a percentage of the total is still less than 2%. NHGRI's portion of its ~$515MM requested budget dedicated to the nine databases is higher at nearly 6%.

Evaluating the value of databases and their funding models is not a simple problem. Questions about who supports a resource, be it NIH, the consumers, or some other business model, need to consider a diverse group of stakeholders including the public (now a major consumer of information at NCBI). We also need to remember that the databases contain data supported with public funds, and many business models - like subscriptions - will make these resources closed, which is orthogonal to NIH's stated view on public access to government funded science.

Biology has become an information science. The days of single gels, scattered observations, limited insights, and eureka moments, are largely over. Hence, databases form a core element of research and public knowledge. Those Clusterd-Reguarly-Interspersed-Short-Palindromic-Repeats that are transforming molecular biology were only discoverable because we had easily accessible data in databases.

NIH continues to spend significantly on new endeavors focused on data production and use but how much of that expenditure should go into managing resources is a legitimate topic of conversation and debate.

Is two cents too much to pay?

Additional Reading:

Funding for key data resources in jeopardy

http://www.slideshare.net/pebourne/the-nih-as-a-digital-enterprise-impl…

More like this

Bio Databases 2015

Something interesting happened in 2014. The total number of databases that Nucleic Acids Research (NAR) tracks dropped by three databases! What happened? Did people quit making databases? No. This year, the "dead" databases (links no longer valid) outnumber the new ones. To celebrate Digital…

Bio Databases 2014

By @finchtalk (Todd Smith) In 2014 and beyond Finchtalk will be contributing to Digitalbio’s blog at this site. We kick off 2014 with Finchtalk’s traditional post on the annual database issue from Nucleic Acids Research (NAR). Biological data and databases are ever expanding. This year was no…

BioDatabases 2017 - What's out there?

It's time for the annual blog about the annual Nucleic Acids Research (NAR) database issue. This is the 24th database issue for NAR and the seventh blog for @finchtalk. Like most years I have no idea what I'm going to write about until I start reading the new issue. Something always inspires me.…

Horse Genome Assembled!

Just got this exciting news by e-mail: Data on Equine Genome Freely Available to Researchers Worldwide BETHESDA, Md., Wed., Feb. 7, 2007 - The first draft of the horse genome sequence has been deposited in public databases and is freely available for use by biomedical and veterinary researchers…

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

What is Biotech?

September 29, 2017

The biotechnology (biotech) industry is incredibly diverse. Recently, I wrote about the size of the biotech industry, which is, of course, related to how biotechnology is defined. As a strict definition, biotechnology is the use of biology to turn raw materials into useful products. However, the…

How Big is Biotech?

August 16, 2017

A simple web search says biotech is really big. One estimate indicates that the industry will have $400 billion in sales in 2017 with growth to over $775 billion by 2024 [1]. Another report suggests there are over 77,000 employers [2]. That’s big, but is it real, and what you can do with this…

BioDatabases 2017 - What's out there?

January 12, 2017

Teach Biology? We want to learn about your use of computers in the classroom

April 13, 2016

Computers, biological data (molecular sequences, structures, and other data), websites, and databases are integral to modern research. Innovations like precision, or personalized medicine, expect a certain level of patient participation, and our future food and environmental sustainability will…

February 16, 2016

Bio Databases 2016

More like this

Bio Databases 2015

Bio Databases 2014

BioDatabases 2017 - What's out there?

Horse Genome Assembled!

What is Biotech?

How Big is Biotech?

BioDatabases 2017 - What's out there?

Teach Biology? We want to learn about your use of computers in the classroom

Bio Databases 2016

Monday Funday WE WON Day!

That Bloody Moon!

Science: What is it really all about?