Top-down or bottom-up?

As I watch the environment around me for signs of data curation inside institutions, particularly in libraries, I seem to see two general classes of approach to the problem. One starts institution-wide, generally with a grand planning process. Another starts at the level of the individual researcher, lab, department or (at most) school; it may try to scale up from there, or it may remain happy as its own self-contained fief.

As with anything, there are costs and benefits to both approaches.

Some of the challenges of data-driven research carry costs and infrastructure that only make sense on an institutional level at this juncture. Grid computing. Gigantic, well-managed disk. (Gigantic disk is fairly cheap. Gigantic well-managed disk will cost you. In my mental model of the universe, I include such things as periodic data audits and geographically-dispersed backups in the cost of disk.) Authorization and authentication, which is a bigger problem than you might think. Carrots and sticks, if the institution is serious about this.

So it makes a certain amount of sense to try to tackle this problem as an institution. Where the institutional model falls down, I begin to suspect, is service beyond the bare provision of appropriate technology. Training and handholding. Outreach. Help with data-sustainability plans in grant proposals. Whipping data into shape for the long term. Advice on sustainability, process, documentation, standards—the nuts and bolts of managing data in a particular research enterprise.

Because data and their associated problems are as varied as the research that create them, I just don't think it's possible to open a single-point-of-service "data curator" office and have that be an effective solution (save perhaps to extremely small, targeted problems like grant proposals). I do still believe that almost any reasonably bright, decently adventurous librarian or IT professional can walk into almost any research situation, get a read on it, and do good things for data. I've seen it happen! But the "getting a read" part takes time and a certain level of immersion. How can a single point of service, whose responsibility is to the entire institution, spend that much effort targeting specific research groups?

Simple. It can't. Moral of the story: data curation is not a Taylorist enterprise.

In practice, I suspect, institutions that create the Office of Data Curation without carefully considering what I just outlined will inexorably wind up serving only a small proportion of the institution's researcher population. It's quite likely to be the proportion of said population swimming in grant money and prestige, of course. The arts, humanities, and qualitative social sciences are most liable to be left hanging. I already see this happening one or two places I know of—not because they have bad or thoughtless people, not at all, but because good people have been handed an organizational structure ill-suited to the task at hand.

Can such a structure be made workable? Perhaps. It'd take some work from the grassroots. Were I in that situation, I'd be canvassing my campus for every single person on it—librarian, IT pro, grant administrator, researcher, graduate student, whoever—who "does data" in some way. Then I'd be working like crazy to turn them into a community of practice.

I admit I'm a little hazy on how communities of practice form and how they can be encouraged to form; I'm sure there's research on the subject (and would appreciate pointers to same). I must also admit that I've tried multiple times to form one around institutional repositories and quite resoundingly failed.

I can only say based on those failures that much depends on what the community-former has to offer, as well as how ready putative community members are to consider themselves part of a coherent community. In this case, how well would it work? I don't know. I'd want something fairly compelling to offer, to get the ball rolling—perhaps some of those institution-wide resources.

About data fiefs I don't have much to say. They exist already, notably in the quantitative social sciences. They seem to work quite well from a service perspective. Unfortunately, some of their technology practices, especially around data sustainability, set my teeth a bit on edge. Format migration? Audits against bitrot? Standards? Persistent, citable URLs for public data? Not so much, some places. And let us not even discuss what happens when the grant money runs out. These places usually aren't geared for the long term, though they do quite well in the medium (say, five to twenty-five years) from what I've seen.

If you think I think there's a sweet spot somewhere in the middle here, you know me entirely too well. At least some of the outlines of the ideal state seem clear: where the rubber meets the researcher, local staffing and control; where the problem goes beyond what local can responsibly or effectively manage, the institution steps in. Likewise, the institution has a responsibility to researchers who need data help but can't afford it locally, in their lab or school or department. There should not be coverage gaps.

By the way—there is, in fact, one organization common on research-university campuses that has learned to be (more or less) centralized while still providing discipline-aware, often discipline-specific, services. It does rather remarkable work serving all campus disciplines, as fairly and skillfully as an unjust world permits. A way out of the Taylorist paradox, perhaps!

What is this wonder organization? It's called "the library."

Tags
Categories

More like this

Many people, first confronted with the idea of data curation, think it's a storage problem. A commonly-expressed notion is "give them enough disk and they'll be fine." Terabyte drives are cheap. Put one on the desk of every researcher, network it, and the problem evaporates, right? Right? Let me…
One of the problems practically every nascent data-curation effort will have to deal with is what serials librarians call the backfile, though the rest of us use the blunter word backlog. There's a lot of digital data (let's not even think about the analog for now) from old projects hanging around…
In many of the data-curation talks and discussions I've attended, a distinction has been drawn between Big Science and small science, the latter sometimes being lumped with humanities research. I'm not sure this distinction completely holds up in practice—are the quantitative social sciences Big or…
The publisher Information Today runs a good and useful book series for librarians who find themselves with job duties they weren't expecting and don't feel prepared for. There's The Accidental Systems Librarian and The Accidental Library Marketer (that one's new) and a whole raft of other accidents…

Good eve,
I hope you are well.
Small comments, plz.

I've been involved with a parallel set of situations for a significant bit of time. I'm a techie working for an agency of the state (WV).

Any organization is politics. Organizations within organizations is politics. Any centralized data management unit will be a political entity, it will be politics.
And politics appears to take it's cues from whatever does not make us look officially bad.

The library model works reasonably well because it's politics are indeed grassroots. That is, library people seem to be able to define themselves as purveyors rather than controllers. (Noting too, that there appears to be a lot of internal promotion and recognition, and a significant amount of historical ethics.)

It's sticky. I've found that humans are quite fond of having hard control of any data that they perceive is critical to their being able to perform un-criticized.
Giving them this, or at least the appearance of this (e.g. with lots of background 'invisible' stuff) seems critical in initiating an approach which produces an improving environment.

And it's that improving environment idea which is hard to come across. Most people want the solution.

tq

By netjaeger (not verified) on 28 Dec 2009 #permalink

Yale seems to be making some moves in this direction. I saw a couple of jobs posted recently from this office: http://odai.research.yale.edu/. It looks like they're trying to bring data and digital content management efforts from various parts of the university together to share expertise and resources.

By Molly Dolan (not verified) on 28 Dec 2009 #permalink

So I see. I will definitely be watching with interest. Any Yale datafolk are cordially invited to comment!

Really nice post Dorothea.

In practice, I suspect, institutions that create the Office of Data Curation without carefully considering what I just outlined will inexorably wind up serving only a small proportion of the institution's researcher population. It's quite likely to be the proportion of said population swimming in grant money and prestige, of course.

I've definitely been witness to this. I really like how you stress the need to build a "community of practice" around data curation. Do any successful efforts at canvassing to help build this sort of collaboration spring to mind? I'm mainly looking for patterns to follow.

Ed, I wish I did. I think the Yale experiment bears watching; as Molly points out, they do seem to want something of this nature.

At MPOW, we have ComETS, which started as a listserv and some informal get-togethers and is now a major campus force in instructional technology. This is the kind of thing I'd want to see -- I just have no idea how ComETS did it, what ambient factors led it to succeed, or whether that model will apply well to data management communities.

Certainly, the idea of communities of practice is attractive, but looking at the last, say, 20 -30 years I'd have the say that real changes actually grew from "killer apps" (as cliched as that sounds). What has had more impact than the spreadsheet? (for data) and the Web (for sharing). I think we may be missing something obvious: users simply don't have the tools to do data "right." We use Excel, email, our desktop computer, FileMaker, etc. Honestly, it's a mess. I'd opt for using the expertise we do have less for instructing non-technical users and more for developing the tools to allow said users to get it right w/o even trying. We need tools that will make sure data that is possible to structure is properly structured as close to the point of "birth" as possible, and to stay that way as it is shared, mashed-up, etc.