The basic carrot: usage statistics

By dsalo on November 16, 2009.

BMC Bioinformatics published this article describing a "data publishing framework" for biodiversity data.

Stripped to its essentials, this article is about carrots for data sharing. Acknowledging that cultural inertia (some of it well-founded) militates against spontaneous data sharing, the authors suggest a way forward.

I'm calling this one out because it has implications for storage-system design. The authors want three things for their public data: persistent identifiers, citation mechanisms, and data usage information.

(For once, I feel good about institutional repositories: they swing two out of three at the minimum, and some manage all three!)

Persistent identifiers seem simple but aren't, necessarily. For example, does a constantly-changing dataset get a persistent identifier? How does that identifier know what it's identifying, in that case? Should a persistent identifier be just a URL? What if the domain name goes away or changes? (This is not an idle concern; the University of Illinois, for example, just changed its top-level domain, and the institutional repository I run is eventually going to lose its separate domain entirely.) What, exactly, gets a persistent identifier? The entire dataset? Files within it? Should a query performable on that dataset also be persistently identifiable? How does that work, exactly? And when does something get its persistent identifier? As soon as it hits the system? Or after it's done and blessed, if it ever is?

Anyway. All of this needs to be hashed out (so to speak). It's not optional, system designers.

Once that's sorted out, citation isn't actually a huge hurdle from where I'm sitting. It's not a technical problem; it's kicking the style manuals into acknowledging data and making citation formats for it.

Usage, now, that's a hurdle. It, too, is utterly necessary for cultural reasons, however. The culture of academia looks kindly on impact measurements, even hopelessly faulty ones. Somehow or other, research impact has to be measured for researchers' careers to advance. Data are no exception.

(In my professional neck of the woods, systems designers ignored the need for usage documentation entirely too long, which has made my life as an IR manager extraordinarily difficult. I make this post in hopes of avoiding the same mistake in this new arena.)

What counts as a "use" exactly? How does "use" get harmonized over different kinds of access schemes? How does an API "use" compare with an entire download?

I don't know. I encourage systems designers not to get too hung up on such questions. Record all accesses and make the best decisions you can right now about how to present them. Yes, you'll have to rewrite the event-analysis code, probably more than once, so comment it well.

Do not, however, wait until you have all the answers to write an analyzer. If you do that, you're strangling the open-data movement in its crib. BMC Bioinformatics explains why.

More like this

You're very right that persistent identifiers for data will not be simple. A lot of work needs to be done to address some of the issues you identify, and others. I've been invited to the launch of something called DataCite at the BL in December (not sure I can go), which I hope will take this seriously.

On the more detailed questions of URL persistence you mention, there is no real substitute for institutional commitment. There is just no other way. Any reorganisation is likely to break existing URLs, many website redesigns do that. It MUST be part of the spec that this be dealt with, up-front, as part of the design process. The actual solution is dead simple if put in place before hand, and sometimes really messy if left until later (effectively you have to build a redirect table, or a redirect algorithm). It can be done; the http://www.ukoln.ac.uk/elib URL has been working since 1005, despite being moved around half a dozen times by my UKOLN colleagues! It can be ignored, as OCLC did when taking in the RLG resources, to their shame, which meant that all links to http://www.rlg.org/broke, including all links to RLG Diginews, the publication addressing digital preservation! Likewis when CURL became RLUK. All these could have been cheaply and simply avoided beforehand, but become expensive and difficult (and embarrassing) later.

grump!

Absolutely agreed on all counts, Chris. Thanks for the tip about DataCite; I'll keep my eye on it!

I'm not worried at all about my upcoming IR move, because we have had a redirection infrastructure in place all along. All I have to do when the IR's subdomain goes away is redirect a bunch of handles, which should pose no difficulty whatever.

As with many issues surrounding data management, the sooner this is addressed, the easier it becomes!

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

Fossil discovery is a new missing link in modern fish evolution

More by this author

We're moving!

August 3, 2010

Looking for us? We're happy to say that we're part of the new Scientopia blogging collective. Come see us there!

Belated Zombie Day post

July 13, 2010

Oh, if I'd only had this picture for Zombie Day... Credit for the photo to UK Serials Group. Credit for the alteration of the speech bubble (you can see the original slide here if you care to) to Steve Lawson. Incidentally, I should have a postprint of an article based on this presentation up…

Promoting a comment: "Open and shared format"

July 8, 2010

Richard Wallis has taken my ribbing in good part, which I appreciate; his response is here and will reward your perusal. He also left a comment here, part of which I will make bold to reproduce: As to RDF underpinning the Linked Data Web - it is only as necessary as HTML was to the growth of the…

Small fry, blogging networks, and reputation

July 8, 2010

So, the PepsiCo blog thing. Right. Advance disclaimer: this is me talking, not either of my illustrious co-bloggers. We have not yet made a decision about what to do; one co-blogger is across the pond at a conference and the other is vacationing, so that discussion will have to wait a bit. This is…

I'd love to dance with you, but...

July 6, 2010

Richard Wallis of Talis (a library-systems vendor) posted The Data Publishing Three-Step to the Talis blog recently. My reaction to this particular brand of reductionism is… shall we say, impolitic. I just want to pat Richard on the head and croon "Who's the clever boy, then? You are! Yes, you are…