Bio Databases 2015

Something interesting happened in 2014. The total number of databases that Nucleic Acids Research (NAR) tracks dropped by three databases!

What happened?  Did people quit making databases?  No.  This year, the "dead" databases (links no longer valid) outnumber the new ones. To celebrate Digital World Biology's release of Molecule World I'll discuss some of the new structure databases below. But first, the numbers.

number of dbs 2015As summarized in the database issue's introduction, Galperin, Rigden, and Fernández-Suárez tell us this year's issue has 172 papers. 56 of those describe new databases, 98 provide updates, and 17 are updates of databases that have been published elsewhere. Together the 56+17+1 other make 74 new entries in the NAR online Molecular Biology Database Collection.  Removing 77 obsolete databases made this year's growth -3.

The despair of riches

The introduction paints an exciting picture of database development. We have updates of existing resources and new resources that can be used to advance multiple kinds of research. I share this view that new knowledge is created from new kinds of database and extensions of existing databases and am always excited to peruse the NAR database issue. But you do not need to peel the onion very far before you begin to cry.

The challenges any user, other than a virtuoso data-miner of the resource at hand, encounters when trying to quickly assess the value of a resource is finding whether it is alive, is classified correctly, and lets you do things with the data other than browse information within the database's web site.  A common tool to help someone evaluate a database's usefulness is an example of the data and a simple demonstration showing how the resource can be used. A short description of the resource's value on the web site also helps.

Structure of Nosiheptide bound the large ribosomal subunit Structure PDB: 2ZJP - Thiopeptide antibiotic Nosiheptide bound to the large ribosomal subunit. The RNA and protein residues in the 23S rRNA and protein L11 are highlighted by residue coloring and ball & stick rendering. Nosiheptide is highlighted with element coloring and space fill rendering. The L11 backbone is shown in magenta and the structurally relevant portion of the 23 rRNA is shown with in element coloring. This image was created in Molecule World.

As noted above, this year I wanted to find some data from a database, other that the RCSB Protein Databank (aka PDB) or the NCBI's MMDB (Molecular Modeling DataBase), that I could download and visualize in Molecule World. I excluded PDB and MMDB because I wanted to try something new.

What did I learn?

I followed the link from the introduction to the NAR Molecular Biology Database collection.  In this collection, databases can be accessed alphabetically, by category, or by other mechanisms. This is a browse and click experience.  Unlike the databases it collects, the issue doesn't allow you to search the collection. Since I wanted to get some structures and look at them in Molecule World, I started with the Structure Database collection.

Indeed, there are many databases in this collection. Structure databases are categorized as Small Molecule, Carbohydrate, Nucleic Acid Structure, and Protein. At the top, there is the Bard (Bioassay reference database from the NIH Molecular Libraries program) database. It contains structures for 39MM chemicals (perhaps a subset of the 50MM chemicals in PubChem?). The Small Molecule, Carbohydrate, Nucleic Acid, and Protein groups hold 24, 12, 22, and 116 databases, respectively. All together that's 175 databases or greater than 10% of the entire collection.

Let's go digging and see if we can find something cool.

The search for interesting structures led me to structures with nucleic acids, proteins, and complexes. One of the databases that caught my eye was SCOR (structural classification of RNA). Unfortunately, the URL -http://scor.lbl.gov - takes you to a page you're not allowed to access. Maybe it's secret work that's been published for people to not use?  How did the reviewers access this?

Another possibly cool one would be Quadbase (G-quadraplex motifs in promoters) - it's URL (http://quadbase.igib.res.in) would not load a page. So far I'm 0 out of 2 just by clicking titles that look neat. Next, I tried NICR (Non-canonical interactions in RNA, http://prion.bchs.uh.edu/bp_type/).  This isn't non-canonical, it's a loop. Every link takes you back to the original page with no obvious way to get to the database. We're 0 for 3.  Did we strike out?  Finding broken database links was NOT our goal.

IMG_0292 Structure PDB:1K0Z. Restriction enzyme PvuII showing that it is a dimer of two identical protein chains. The protein backbones are shown with rainbow (amino to carboxyl, red to blue) coloring. The Pr atoms are shown as balls.  Within each chain, the interacting residues are colored by element.

MetalPDB (http://metalweb.cerm.unifi.it), categorized under Nucleic Acids, is as the title suggests, a database of metal-binding sites in biological macromolecules.  Although we found this listed in the Nucleic Acids category, the structures are mostly proteins.

A great feature in this database is a very cool search tool. It's a periodic table with a radio button under each metal. Metals without corresponding PDB structures have with white symbols.

The white chemicals provide a quick way to tell with metals are found in in structures and which are not (29 out of 84).  Our own data suggest this is a little on the low side, but that's another story.

I selected Pr (prasodymium) because, why not. 32 structures were returned, with the first in the list PDB:1K0z. This protein is the restriction enzyme PvuII from Proteus vulgaris.  

But here's where the experience moves from great to just OK. You can only work with structures within the website. Downloads? Make a small collection for a class? Sorry.  You can't get there from here. Luckily, if I have a PDB ID, I know how to use it. With a quick search inside of Molecule World, I can get a structure from either the MMDB or PDB databases and make a fun picture. Ok, this database issue gets 1 point for mission accomplished (I found data, yeah!), -0.1 for being misclassifying under nucleic acids, and -0.3 for not making the data easy to pull out. They get 0.6. That makes our cumulative score for today's adventure 0.6 out of 4.

What's in the group of 116 databases under proteins? A lot of specialized things. My first try was 3D-Genomics. The link http://www.sbg.bio.ic.ac.uk/~3dgenomics/, returns, "The 3DGenomics server is no longer available." Again, seriously? It was then that I looked closely at the title of the NAR Molecular Biology Database Collection (category list page), is says "2014 NAR Database Summary Paper Category List." Did I get lost in the bowels of the NAR webpages? Going back to the top and rechecking the first link takes you to the top of to collection where this title says "2014 NAR Database Summary Paper Alphabetic List." That's right I got to the 2014 list from the 2015 introduction.

Maybe it was random, bad luck chance that led me to stumble on 4 of the 77 obsolete databases in my first 5 attempts to find cool things other people are doing. Or maybe you really can't judge a book by its cover.

Further Reading

http://finchtalk.blogspot.com/2011/01/databases-of-databases.html

http://finchtalk.blogspot.com/2012/01/bio-databases-2012.html

http://finchtalk.blogspot.com/2013/01/bio-databases-2013.html

http://scienceblogs.com/digitalbio/2014/01/09/bio-databases-2014

More like this

By @finchtalk (Todd Smith) In 2014 and beyond Finchtalk will be contributing to Digitalbio’s blog at this site. We kick off 2014 with Finchtalk’s traditional post on the annual database issue from Nucleic Acids Research (NAR). Biological data and databases are ever expanding. This year was no…
Replication fork -  http://en.wikipedia.org/wiki/Telomere. Organisms with linear chromosomes have to solve the problem that DNA replication makes them shorter. This is due to the fact that DNA polymerase can only add bases to the terminal 3'-OH of a DNA chain. The DNA replication initiation…
Someone missed the memo. Over the past year, news and presentations by NIH leaders like Philip Bourne have communicated that the proliferation biologically focused databases is unsustainable. However, unlike last year, where the number of databases tracked by Nucleic Acids Research (NAR) dropped by…
It's well understood in science education that students are more engaged when they work on problems that matter.  Right now, Zika virus matters.  Zika is a very scary problem that matters a great deal to anyone who might want to start a family and greatly concerns my students. I teach a…

I love it

By Muawiyya Yakubu (not verified) on 02 Feb 2015 #permalink

Thank you very much for your comments. While checking the database list in Nov. 2014, we've noticed that SCOR was dead but its authors promised to bring it back to life in the near future, so it was left in the list. We were not aware of the problems with the other three databases. We'll contact their authors before to make sure they are no longer maintained. The problem is that these database have been last featured in the 2002, 2004 and 2008 NAR Database Issues. We ask the authors to promise maintaining their database for 5 years after the initial publication, after that it is largely up to them.
Please send your comments and reports of obsolete databases in the NAR database list to xose.m.fernandez@gmail.com and nardatabase@gmail.com

By Michael Galperin (not verified) on 04 Feb 2015 #permalink

Thanks for the update and clarification, Michael. I will send a note to the emails as you suggest. I always enjoy the database issue and seeing what people are doing. As noted, my investigation was not systematic, I picked the DB's by their names and my interest for this year's post.