Many of my readers will already have seen the Nature special issue on data, data curation, and data sharing. If you haven’t, go now and read; it’s impossible to overestimate the importance of this issue turning up in such a widely-read venue.
I read the opening of “Data sharing: Empty archives” with a certain amount of bemusement, as one who has been running institutional repositories in libraries for four years. I think Bryn Nelson has confusingly conflated different notions of “data” in his discussion of the University of Rochester’s IR.
By the definition Nelson appears to be thinking about, anything digital is automatically data. Thus “dissertations, preprints, working papers, photographs, music scores?” This is not the definition I use for this weblog, nor is it (I believe I am qualified to say by now) what most academic libraries starting IRs had in mind, either.
For purposes of this weblog, the word “data” means the stuff coming out of the research process that isn’t prose aimed at a human audience. That’s loosey-goosey (most definitions of “data” are), but you get the general idea. A dissertation is not data. Neither is a preprint or a working paper. A photograph might be. A music score probably isn’t (how many music scores are research products?). Research data aren’t research documents.
(Historians of science, please close your eyes for a bit; you’re different. I know research documents are data to you. That’s still not what I mean by the term.)
I can’t and don’t speak for the University of Rochester; I don’t know what their IR’s collection-development policy is, nor what was going through their minds when the IR was on the planning table. I do know with fair certainty that for most IRs, the problem of data (in this weblog’s definition) wasn’t so much as a gleam in anybody’s eye at the outset. Indeed, for many IRs it still isn’t. Libraries started IRs hoping for open access to the journal literature and better access to and preservation of digital gray literature (dissertations, working papers, technical reports, et cetera).
Perhaps Rochester was an exception; again, I don’t know. But attributing the emptiness of IRs to a problem with data-sharing makes my head hurt. It doesn’t square at all with my lived experience of IRs.
Now, the emptiness problem meant that most if not all IRs expanded their collection and service scope, simply out of necessity. For an excellent, nuanced discussion of this phenomenon, read a Mellon grant report by Carole Palmer et al. Do datasets fall within IRs’ purview now? Well? maybe. Depends whom you ask.
I don’t want to wander off into the over-technical weeds here, so I’ll limit myself to remarking that the technology underlying most IRs (both hosted and roll-your-own) is extraordinarily poorly suited to much research data, having been optimized for documents. This is a serious stumbling block for IRs wanting to expand into data curation.
That problem aside, however, the important question of incentive remains. Even if we accept my division between research documents and research data, who is to say that institution-based data collection will work any better than institution-based document collection? Will data archives remain as empty as IRs?
I think not. Perhaps over-optimistically, but even so. I think not, and here’s why.
When IR managers went to faculty, hat in hand, asking for preprints and postprints, they were charging Quixotesquely against a gigantic windmill: the existing scholarly-communication system, which as far as most faculty in most disciplines are concerned works just fine. Filling IRs with the peer-reviewed literature they were established to collect meant changing minds, hearts, and (most crucially) workflows. In practice, it was an impossible dream, especially as IR technology bears more than a little resemblance to the spavined nag Rosinante, and IR managers had little to wield by way of spear or shield.
Right. Now that I have run that metaphor into the ground and stomped its gravesite flat? why are data repositories different?
First, for most disciplines, there simply is no analogue to the existing scholarly-communication system where data are concerned. For pity’s sake, we haven’t even worked out how to cite data yet! Where researcher workflows and expectations are not yet formed, opportunity awaits.
Second, data repositories and their managers can offer real, meaningful help to researchers, in ways IRs either didn’t or couldn’t. Publishers are perceived, rightly or wrongly, as providing meaningful service to research; IRs have not achieved that perception, so they have made few meaningful inroads among researchers. Data repositories, data librarians, and data technicians can solve real-world problems that many researchers are already feeling, and many more are likely to feel soon.
Third, the effects of widely-available data are amassing a fairly impressive track record even at this early date. Genomics, economics, literary text-mining, linguistics, name your disciplinary poison: digital data enable answers to more and different questions, faster. IRs? Not so much with the palpable effects (outside ETDs), I’m sorry to say.
Last, the regulatory framework around data seems likely to solidify a good deal faster than the framework around open access to publications. Part of this, of course, comes from not needing to push against entrenched interests, the way the NIH Public Access Policy had to fend off the large-publisher lobby. I understand funder hesitation surrounding the dearth of data standards and the dearth of sustainable repositories, but I’m willing to hazard that the right hands will eventually shake each other for all that to work itself out.
So on balance, I’m hopeful. Nothing is certain; sometimes the many ways to mess this up keep me awake at night. I see motion on so many levels, though?from individual researchers all the way up to huge government funders?that I think data curation is very nearly a foregone conclusion.
Ways and means? well, I have to have some uncertainty around to keep this weblog active.