The dreaded backfile

By dsalo on September 29, 2009.

One of the problems practically every nascent data-curation effort will have to deal with is what serials librarians call the backfile, though the rest of us use the blunter word backlog.

There's a lot of digital data (let's not even think about the analog for now) from old projects hanging around institutions. My institution. Your institution. Any institution. There may be wonderful data in there, but chances are they're in terrible condition: disorganized, poorly described if described at all, on perishable (and very possibly perished) physical media. This pile of mostly-undifferentiated stuff is what all the digital-heat-death-of-the-universe people are on about.

What to do about it? Make no mistake, it takes considerably more human ingenuity and effort to rescue data than to treat it right at the outset. If a small data-curation team just out of the starting gate tries seriously to come to grips with the backlog problem, it will almost certainly swamp itself, to the point that it won't be able to get in on the ground floor of new data-generating projects—which of course only perpetuates the problem.

I hate to say this, but… I believe we'll have to leave a lot of those data lie. We can use some of the backlog to learn on; I would be inclined to start with data relating to a revered institutional priority such as theses and dissertations. We can possibly also pick up a few horses in midstream, researcher workflow permitting.

Grant agencies should look seriously at data-rescue projects, in my opinion. Grant funding is lousy for sustainability, but for rescue projects where the main effort is a one-time licking into shape and the sustainability is a given, grant funding makes a lot of sense. There's certainly no lack of data to rescue!

Still, I strongly believe that the principal priority of a new data-curation team should be new data, new workflows, and new research projects. Perpetually playing catch-up is not a good space to be in. Also, faculty aren't nearly as engaged with their old projects as their current ones, so for good word of mouth and campus visibility, working with current projects is the way to go.

Thanks to Chris Rusbridge for making me think about this. The answer I arrived at wasn't the one I expected to.

A short reminder: I'm at Access 2009 the rest of this week. Blogging is liable to be nonexistent.

More like this

Quick thought: rejecting data or rejecting people?

I'm still buried in translating a presentation into Spanish for Monday and finishing another in English for Wednesday, but here's a small thought to tide folks over, a thought that came to me shortly before my presentation at Access.

Talkin' 'bout my institution: A clarification

A comment Chris Rusbridge left on a previous post leads me to clarify the extent to which the subject matter of this blog draws on my own position in the institution where I work, and that institution's take on matters data-curational.

Service sustainability

When Steve Hitchcock says that "sustainability must precede preservation for institutional repositories," what does he mean?

Set your house in order

Roy Tennant sent me an email about my Access presentation in which he asked what libraries should do about the laundry-list of data-curation challenges I presented.

Back in my first IT job I was working in a shop with a DG MV9600U and Macs connected to it via Pacer.

Thing was, we all had the office suite on our Macs and the DG was getting long in the tooth.

The DG offered a WordPerfect editor and we had hundreds of documents in it.

They had gone to a company to document conversion. The number came back at $60,000.

I wrote a CLI script that tossed the WP files over a a Novell server and then wrote some conversion code on Word to convert them and re-save them in Word format.

Saved us a bundle.

Sure. It's an easy problem if all your stuff is in one format (assuming that format is reverse-engineerable...) and you don't have to provide any documentation or organization for the files.

Research data don't live in that world.

Now that I think about it, having data curation consultants and a centralized secure data repository (and administrative team to run it) should provided by the institution. The overhead percentages are certainly high enough that these sorts of *necessary* services and facility should be provided.

I'm reading this while procrastinating mid-data rescue project.

It is being funded as a no-cost extension to a previous grant - which is like being told you didn't do the project properly - you didn't even spend all the money! - but at least you can clean up after yourself.

But even so, to do a good job after the event needs a carrot - in this case known future value. Hubris ("I always do a good job") is really not enough faced with someone else's ill-considered data choices.

Alternatively, if there is no carrot, I guess a big stick would do it - you folk saw:

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0007…

Savage CJ, Vickers AJ, (2009) Empirical Study of Data Sharing by Authors Publishing in PLoS Journals. PLoS ONE 4(9): e7078. doi:10.1371/journal.pone.0007078

in which 2 researchers tried, and failed to get hold of data from 9 out of 10 PLoS papers, despite there being an on-publication data sharing requirement in place?

That'd be a lack of we-have-to-put-the-data-in-a-repository plan, then.

Here's an example I ran across recently of what it takes to get that old info available: www.moonviews.com

I wrote a CLI script that tossed the WP files over a a Novell server and then wrote some conversion code on Word to convert them and re-save them in Word format.

Saved us a bundle.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

We're moving!

August 3, 2010

Looking for us? We're happy to say that we're part of the new Scientopia blogging collective. Come see us there!

Belated Zombie Day post

July 13, 2010

Oh, if I'd only had this picture for Zombie Day... Credit for the photo to UK Serials Group. Credit for the alteration of the speech bubble (you can see the original slide here if you care to) to Steve Lawson. Incidentally, I should have a postprint of an article based on this presentation up…

Promoting a comment: "Open and shared format"

July 8, 2010

Richard Wallis has taken my ribbing in good part, which I appreciate; his response is here and will reward your perusal. He also left a comment here, part of which I will make bold to reproduce: As to RDF underpinning the Linked Data Web - it is only as necessary as HTML was to the growth of the…

Small fry, blogging networks, and reputation

July 8, 2010

So, the PepsiCo blog thing. Right. Advance disclaimer: this is me talking, not either of my illustrious co-bloggers. We have not yet made a decision about what to do; one co-blogger is across the pond at a conference and the other is vacationing, so that discussion will have to wait a bit. This is…

I'd love to dance with you, but...

July 6, 2010

Richard Wallis of Talis (a library-systems vendor) posted The Data Publishing Three-Step to the Talis blog recently. My reaction to this particular brand of reductionism is… shall we say, impolitic. I just want to pat Richard on the head and croon "Who's the clever boy, then? You are! Yes, you are…