The dreaded backfile

One of the problems practically every nascent data-curation effort will have to deal with is what serials librarians call the backfile, though the rest of us use the blunter word backlog.

There's a lot of digital data (let's not even think about the analog for now) from old projects hanging around institutions. My institution. Your institution. Any institution. There may be wonderful data in there, but chances are they're in terrible condition: disorganized, poorly described if described at all, on perishable (and very possibly perished) physical media. This pile of mostly-undifferentiated stuff is what all the digital-heat-death-of-the-universe people are on about.

What to do about it? Make no mistake, it takes considerably more human ingenuity and effort to rescue data than to treat it right at the outset. If a small data-curation team just out of the starting gate tries seriously to come to grips with the backlog problem, it will almost certainly swamp itself, to the point that it won't be able to get in on the ground floor of new data-generating projects—which of course only perpetuates the problem.

I hate to say this, but… I believe we'll have to leave a lot of those data lie. We can use some of the backlog to learn on; I would be inclined to start with data relating to a revered institutional priority such as theses and dissertations. We can possibly also pick up a few horses in midstream, researcher workflow permitting.

Grant agencies should look seriously at data-rescue projects, in my opinion. Grant funding is lousy for sustainability, but for rescue projects where the main effort is a one-time licking into shape and the sustainability is a given, grant funding makes a lot of sense. There's certainly no lack of data to rescue!

Still, I strongly believe that the principal priority of a new data-curation team should be new data, new workflows, and new research projects. Perpetually playing catch-up is not a good space to be in. Also, faculty aren't nearly as engaged with their old projects as their current ones, so for good word of mouth and campus visibility, working with current projects is the way to go.

Thanks to Chris Rusbridge for making me think about this. The answer I arrived at wasn't the one I expected to.

A short reminder: I'm at Access 2009 the rest of this week. Blogging is liable to be nonexistent.

Tags

More like this

Back in my first IT job I was working in a shop with a DG MV9600U and Macs connected to it via Pacer.

Thing was, we all had the office suite on our Macs and the DG was getting long in the tooth.

The DG offered a WordPerfect editor and we had hundreds of documents in it.

They had gone to a company to document conversion. The number came back at $60,000.

I wrote a CLI script that tossed the WP files over a a Novell server and then wrote some conversion code on Word to convert them and re-save them in Word format.

Saved us a bundle.

Sure. It's an easy problem if all your stuff is in one format (assuming that format is reverse-engineerable...) and you don't have to provide any documentation or organization for the files.

Research data don't live in that world.

Now that I think about it, having data curation consultants and a centralized secure data repository (and administrative team to run it) should provided by the institution. The overhead percentages are certainly high enough that these sorts of *necessary* services and facility should be provided.

I'm reading this while procrastinating mid-data rescue project.

It is being funded as a no-cost extension to a previous grant - which is like being told you didn't do the project properly - you didn't even spend all the money! - but at least you can clean up after yourself.

But even so, to do a good job after the event needs a carrot - in this case known future value. Hubris ("I always do a good job") is really not enough faced with someone else's ill-considered data choices.

Alternatively, if there is no carrot, I guess a big stick would do it - you folk saw:

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0007…

Savage CJ, Vickers AJ, (2009) Empirical Study of Data Sharing by Authors Publishing in PLoS Journals. PLoS ONE 4(9): e7078. doi:10.1371/journal.pone.0007078

in which 2 researchers tried, and failed to get hold of data from 9 out of 10 PLoS papers, despite there being an on-publication data sharing requirement in place?

That'd be a lack of we-have-to-put-the-data-in-a-repository plan, then.

Here's an example I ran across recently of what it takes to get that old info available: www.moonviews.com

I wrote a CLI script that tossed the WP files over a a Novell server and then wrote some conversion code on Word to convert them and re-save them in Word format.

Saved us a bundle.