BioDatabases 2017 - What's out there?

By finchtalk on January 12, 2017.

It's time for the annual blog about the annual Nucleic Acids Research (NAR) database issue. This is the 24th database issue for NAR and the seventh blog for @finchtalk. Like most years I have no idea what I'm going to write about until I start reading the new issue. Something always inspires me.

This year's inspiration came from missing data.

In 2017, NAR lists 1662 databases or 23 fewer than last year.

As summarized in the database issue's introduction, Galperin, Fernández-Suarez, and Rigden tell us this year's issue has 152 papers. 54 of those describe new databases, 98 provide updates, and 16 are updates of databases that have been published elsewhere. 18 duplicate entries and 30 obsolete database have been removed. But we are not told how many databases are in the catalog. That is an exercise for the reader.

Given that last year the authors stated that there were 1685 databases one would assume that this year's total would be 1685+54+16-18-30=1707, or 1691 if the 16 updated databases were in the catalog and just described somewhere else. But, since we are not told that, we need to figure it out on our own.

Fortunately, the entire list of databases is available, so all you have to do is visit the page and count the entries. Ok, that would be tedious and take forever because you'd have to check your work and likely get lost several times doing so. Instead, one can capture the text and write a Perl script to count the entries. When I did this, I got 1662 for an answer. This is neither 1707 nor 1691. As the catalog is maintained through the year, more databases have likely been removed than were reported in the article.

As I counted the entries, I also looked at the titles and descriptions and thought about what could we learn from this information. After all, these 1662 databases are used to develop scientific knowledge. Can we use this data to learn about the kinds of things scientists are interested in?

Now my simple Perl script grew from a command line that counted empty lines to a script that had to grab the second line of each entry - triggered by an empty line using a state machine, with an initialization to get the first entry - parse that second line and count the words. For students interested in bioinformatics, this is a common exercise with data.

Once that was done, a review of the words indicated some clean up was in order. Common words, that added little value, were removed. Also, plurals were converted to singular forms to avoid duplication of terms. The last step was to use wordle™ to create a tag cloud of terms found in the database descriptions.

So, what did I learn?

First, database is the most common term. Nearly 25% of the descriptions use that term. The next most frequent term is protein, which is followed by gene, genome, human, sequence, and data. The term structure, something we're interested in at Digital World Biology, is the eighth most frequent term. It is followed by genomic, interaction, and expression.

While DNA sequencing captures attention in the news, understanding how genotypes impact phenotypes requires that we deeply understand the relationship between sequence, structure, and function. Thus, it is not surprising that the most common terms describing biological databases would include words that describe this relationship.

The other interesting finding is the sheer number of unique words. The tag cloud above summarizes 150 of 2370 total words. To be listed in the tag cloud a word had to used at least nine times. Words used only once occurred over 1500 times. These are interesting and instructive too. A few of the words indicate that there are databases that include information on waterfleas, mites, exosomes, leptospira, paramecium, amoebazoa, honey, plexipus, bananas, and many others. The words used once list also includes misspellings, word fragments, and words that add context to descriptions, many of these are chemical and biochemical terms.

The real importance of the number and variety of words used to describe the databases however, is that biological databases store and organize data and information about biology. And, the complexity of biology cannot be stored in a single source.

More like this

Bio Databases 2016

Someone missed the memo.

Bio Databases 2015

Something interesting happened in 2014. The total number of databases that Nucleic Acids Research (NAR) tracks dropped by three databases!

Biologists vs. the Age of Information

It's pretty common these days to pick up an issue of Science or Nature and see people ranting about GenBank (1). Many of the rants are triggered, at least in part, by a wide-spread misunderstanding of what GenBank is and how it works.

Bio Databases 2014

By @finchtalk (Todd Smith) In 2014 and beyond Finchtalk will be contributing to Digitalbio’s blog at this site. We kick off 2014 with Finchtalk’s traditional post on the annual database issue from Nucleic Acids Research (NAR).

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

What is Biotech?

September 29, 2017

The biotechnology (biotech) industry is incredibly diverse. Recently, I wrote about the size of the biotech industry, which is, of course, related to how biotechnology is defined. As a strict definition, biotechnology is the use of biology to turn raw materials into useful products. However,…

How Big is Biotech?

August 16, 2017

A simple web search says biotech is really big. One estimate indicates that the industry will have $400 billion in sales in 2017 with growth to over $775 billion by 2024 [1]. Another report suggests there are over 77,000 employers [2]. That’s big, but is it real, and what you can do with this…

BioDatabases 2017 - What's out there?

January 12, 2017

Teach Biology? We want to learn about your use of computers in the classroom

April 13, 2016

Computers, biological data (molecular sequences, structures, and other data), websites, and databases are integral to modern research. Innovations like precision, or personalized medicine, expect a certain level of patient participation, and our future food and environmental sustainability…

Bio Databases 2016

February 16, 2016

Someone missed the memo. Over the past year, news and presentations by NIH leaders like Philip Bourne have communicated that the proliferation biologically focused databases is unsustainable. However, unlike last year, where the number of databases tracked by Nucleic Acids Research (NAR)…

BioDatabases 2017 - What's out there?

More like this

Bio Databases 2016

Bio Databases 2015

Biologists vs. the Age of Information

Bio Databases 2014

What is Biotech?

How Big is Biotech?

BioDatabases 2017 - What's out there?

Teach Biology? We want to learn about your use of computers in the classroom

Bio Databases 2016

The Apocalypse really is happening and the Rapture went off without a hitch

National Geographic's Wild Case Files covers the 'Montauk monster'

The Evolution of Starlight