Now on ScienceBlogs: Dr. Rolando Arafiles: Antivaccine rhetoric, colloidal silver for the flu, and Morgellons disease

Enter to Win

The Book of Trogool

E-research, cyberinfrastructure, data curation... an academic librarian confronts the way computers are changing academic research.

Profile

I'm Dorothea Salo, an academic librarian exploring the practices, processes, and praxis of e-research.

Wondering what the blog's name means? Allusion explained here.

Want to contact me out-of-band? Please email dorothea.salo at gmail.

Commenters: please read and abide by this blog's comment policy. Thanks!

Upcoming talks and travel

Archives

Recent Comments

Blogroll: Library Folk

Blogroll: Research and Researchers

« International Digital Curation Conference | Main | Tracking my eyes »

The basic carrot: usage statistics

Category: Tactics
Posted on: November 16, 2009

BMC Bioinformatics published this article describing a "data publishing framework" for biodiversity data.

Stripped to its essentials, this article is about carrots for data sharing. Acknowledging that cultural inertia (some of it well-founded) militates against spontaneous data sharing, the authors suggest a way forward.

I'm calling this one out because it has implications for storage-system design. The authors want three things for their public data: persistent identifiers, citation mechanisms, and data usage information.

(For once, I feel good about institutional repositories: they swing two out of three at the minimum, and some manage all three!)

Persistent identifiers seem simple but aren't, necessarily. For example, does a constantly-changing dataset get a persistent identifier? How does that identifier know what it's identifying, in that case? Should a persistent identifier be just a URL? What if the domain name goes away or changes? (This is not an idle concern; the University of Illinois, for example, just changed its top-level domain, and the institutional repository I run is eventually going to lose its separate domain entirely.) What, exactly, gets a persistent identifier? The entire dataset? Files within it? Should a query performable on that dataset also be persistently identifiable? How does that work, exactly? And when does something get its persistent identifier? As soon as it hits the system? Or after it's done and blessed, if it ever is?

Anyway. All of this needs to be hashed out (so to speak). It's not optional, system designers.

Once that's sorted out, citation isn't actually a huge hurdle from where I'm sitting. It's not a technical problem; it's kicking the style manuals into acknowledging data and making citation formats for it.

Usage, now, that's a hurdle. It, too, is utterly necessary for cultural reasons, however. The culture of academia looks kindly on impact measurements, even hopelessly faulty ones. Somehow or other, research impact has to be measured for researchers' careers to advance. Data are no exception.

(In my professional neck of the woods, systems designers ignored the need for usage documentation entirely too long, which has made my life as an IR manager extraordinarily difficult. I make this post in hopes of avoiding the same mistake in this new arena.)

What counts as a "use" exactly? How does "use" get harmonized over different kinds of access schemes? How does an API "use" compare with an entire download?

I don't know. I encourage systems designers not to get too hung up on such questions. Record all accesses and make the best decisions you can right now about how to present them. Yes, you'll have to rewrite the event-analysis code, probably more than once, so comment it well.

Do not, however, wait until you have all the answers to write an analyzer. If you do that, you're strangling the open-data movement in its crib. BMC Bioinformatics explains why.

Share on: Stumbleupon Reddit Email + More

TrackBacks

TrackBack URL for this entry: http://scienceblogs.com/mt/pings/124876

Comments

1

You're very right that persistent identifiers for data will not be simple. A lot of work needs to be done to address some of the issues you identify, and others. I've been invited to the launch of something called DataCite at the BL in December (not sure I can go), which I hope will take this seriously.

On the more detailed questions of URL persistence you mention, there is no real substitute for institutional commitment. There is just no other way. Any reorganisation is likely to break existing URLs, many website redesigns do that. It MUST be part of the spec that this be dealt with, up-front, as part of the design process. The actual solution is dead simple if put in place before hand, and sometimes really messy if left until later (effectively you have to build a redirect table, or a redirect algorithm). It can be done; the http://www.ukoln.ac.uk/elib URL has been working since 1005, despite being moved around half a dozen times by my UKOLN colleagues! It can be ignored, as OCLC did when taking in the RLG resources, to their shame, which meant that all links to http://www.rlg.org/broke, including all links to RLG Diginews, the publication addressing digital preservation! Likewis when CURL became RLUK. All these could have been cheaply and simply avoided beforehand, but become expensive and difficult (and embarrassing) later.

grump!

Posted by: Chris Rusbridge | November 18, 2009

2

Absolutely agreed on all counts, Chris. Thanks for the tip about DataCite; I'll keep my eye on it!

I'm not worried at all about my upcoming IR move, because we have had a redirection infrastructure in place all along. All I have to do when the IR's subdomain goes away is redirect a bunch of handles, which should pose no difficulty whatever.

As with many issues surrounding data management, the sooner this is addressed, the easier it becomes!

Posted by: Dorothea Salo | November 18, 2009

Post a Comment

(Email is required for authentication purposes only. On some blogs, comments are moderated for spam, so your comment may not appear immediately.)





ScienceBlogs

Search ScienceBlogs:

Go to:

Advertisement
Collective Imagination
Enter to win the daily giveaway
Advertisement
Collective Imagination

© 2006-2009 ScienceBlogs LLC. ScienceBlogs is a registered trademark of ScienceBlogs LLC. All rights reserved.