The Book of Trogool

The basic carrot: usage statistics

BMC Bioinformatics published this article describing a “data publishing framework” for biodiversity data.

Stripped to its essentials, this article is about carrots for data sharing. Acknowledging that cultural inertia (some of it well-founded) militates against spontaneous data sharing, the authors suggest a way forward.

I’m calling this one out because it has implications for storage-system design. The authors want three things for their public data: persistent identifiers, citation mechanisms, and data usage information.

(For once, I feel good about institutional repositories: they swing two out of three at the minimum, and some manage all three!)

Persistent identifiers seem simple but aren’t, necessarily. For example, does a constantly-changing dataset get a persistent identifier? How does that identifier know what it’s identifying, in that case? Should a persistent identifier be just a URL? What if the domain name goes away or changes? (This is not an idle concern; the University of Illinois, for example, just changed its top-level domain, and the institutional repository I run is eventually going to lose its separate domain entirely.) What, exactly, gets a persistent identifier? The entire dataset? Files within it? Should a query performable on that dataset also be persistently identifiable? How does that work, exactly? And when does something get its persistent identifier? As soon as it hits the system? Or after it’s done and blessed, if it ever is?

Anyway. All of this needs to be hashed out (so to speak). It’s not optional, system designers.

Once that’s sorted out, citation isn’t actually a huge hurdle from where I’m sitting. It’s not a technical problem; it’s kicking the style manuals into acknowledging data and making citation formats for it.

Usage, now, that’s a hurdle. It, too, is utterly necessary for cultural reasons, however. The culture of academia looks kindly on impact measurements, even hopelessly faulty ones. Somehow or other, research impact has to be measured for researchers’ careers to advance. Data are no exception.

(In my professional neck of the woods, systems designers ignored the need for usage documentation entirely too long, which has made my life as an IR manager extraordinarily difficult. I make this post in hopes of avoiding the same mistake in this new arena.)

What counts as a “use” exactly? How does “use” get harmonized over different kinds of access schemes? How does an API “use” compare with an entire download?

I don’t know. I encourage systems designers not to get too hung up on such questions. Record all accesses and make the best decisions you can right now about how to present them. Yes, you’ll have to rewrite the event-analysis code, probably more than once, so comment it well.

Do not, however, wait until you have all the answers to write an analyzer. If you do that, you’re strangling the open-data movement in its crib. BMC Bioinformatics explains why.

Comments

  1. #1 Chris Rusbridge
    November 18, 2009

    You’re very right that persistent identifiers for data will not be simple. A lot of work needs to be done to address some of the issues you identify, and others. I’ve been invited to the launch of something called DataCite at the BL in December (not sure I can go), which I hope will take this seriously.

    On the more detailed questions of URL persistence you mention, there is no real substitute for institutional commitment. There is just no other way. Any reorganisation is likely to break existing URLs, many website redesigns do that. It MUST be part of the spec that this be dealt with, up-front, as part of the design process. The actual solution is dead simple if put in place before hand, and sometimes really messy if left until later (effectively you have to build a redirect table, or a redirect algorithm). It can be done; the http://www.ukoln.ac.uk/elib URL has been working since 1005, despite being moved around half a dozen times by my UKOLN colleagues! It can be ignored, as OCLC did when taking in the RLG resources, to their shame, which meant that all links to http://www.rlg.org/broke, including all links to RLG Diginews, the publication addressing digital preservation! Likewis when CURL became RLUK. All these could have been cheaply and simply avoided beforehand, but become expensive and difficult (and embarrassing) later.

    grump!

  2. #2 Dorothea Salo
    November 18, 2009

    Absolutely agreed on all counts, Chris. Thanks for the tip about DataCite; I’ll keep my eye on it!

    I’m not worried at all about my upcoming IR move, because we have had a redirection infrastructure in place all along. All I have to do when the IR’s subdomain goes away is redirect a bunch of handles, which should pose no difficulty whatever.

    As with many issues surrounding data management, the sooner this is addressed, the easier it becomes!

The site is undergoing maintenance presently. Commenting has been disabled. Please check back later!