Cost and service models for data curation

In many of the data-curation talks and discussions I've attended, a distinction has been drawn between Big Science and small science, the latter sometimes being lumped with humanities research. I'm not sure this distinction completely holds up in practice—are the quantitative social sciences Big or small? what about medicine?—but there's definitely food for thought there.

Big Science produces big, basically homogeneous data from single research projects, on the order of terabytes in short timeframes. For Big Data, building enough reliable storage is a big deal; it's hard to even look at the rest of the problem until the storage piece is solved. Some in the data curation space focus unabashedly exclusively on Big Science—Lee Dirks's well-constructed and lucid talk at Harvard yesterday hinted that he is one of these. Standards for data tend to grow fairly quickly in Big Science environments, both de facto (because there's only one source for the data!) and de jure (as in astronomy, which is a fascinating story I'm not quite competent to tell).

Big Science also has big money. It can't be done at all otherwise. The corollary to big money is big teams of researchers and allies.

Small science is what those of us who work at colleges and universities are more accustomed to. Grants are small if they exist at all; research is generally a solo or single-lab endeavor. Research procedures are often ad-hoc, invented by the researcher like Minerva springing from the head of Jove. Data standards do not exist; as often as not, there isn't a critical mass of people doing similar enough work and willing enough to share data to come together to create a data standard.

It has been asserted that small science, taken as a whole, is likely to create more research data than Big Science. When I tracked this assertion toward its source some time ago, the source turned out to be an otherwise-unsupported statement in the Chronicle of Higher Education (can't link; article behind paywall). So I give you this assertion despite not having any proof for it other than intuition. It is intuitive: Big Science accounts for few researchers owing to its expense; small science is a horde, comparatively. Many small datasets add up startlingly fast, partly because storage for each one is less of an immediate issue, partly because the fundedness or Bigness of a science is not necessarily a good measure of its data requirements. (Any research creating high-def digital video in quantity right now is stuck in just as nasty a storage problem as Big Science.)

When I look at business models and processes for data curation, honestly storage is the least-interesting aspect of the problem to me. Partly this is privilege talking: where I work, the intricacies of digital storage are Somebody Else's Problem. All I have to do is find stuff to fill it up! Partly it's consciousness that this problem is absolutely being actively worked on—watch Dirks's presentation for examples. I have faith that the storage problem will be decently managed.

Mostly, though, it's that I'm a librarian, not a sysadmin. The problems that interest me about data are the description, discovery, format, interoperability, and human problems. And I can see a serious, scary human problem lurking under the Big-versus-small science question.

I'm going to hold it as axiomatic that on some level, all of the data arising from the research enterprise are equal in importance, at least potentially. We can't know a priori which researcher studying which phenomenon in which institution will produce data that make possible a startling insight. We triply can't know this a priori because of aftermarket (so to speak) data mashups. The original experiment may have been a bust, or the original observation apparently uninteresting, but just combine those data with other data and watch them fly!

It does not seem, though, that under the data regimes emerging, all data will receive equal care. Even within our own institutions, them that has the gold will make the rules, as "cost recovery" becomes the order of the day. Big Science has the gold. Small science doesn't, and neither do the humanities.

I wonder whether cost-recovery institutional cyberinfrastructure will manage to survive, honestly. (I hasten to say I don't know that it will fail, but I have misgivings.) Big Science has a history of funding and managing its own research-related services, even to running its own libraries. Why would data curation be the exception? Arguably it should be because of the long-term, past-grant-expiration sustainability requirement, but I don't think that argument has ever stopped Big Science before. So where are cost-recovery ops going to recoup their costs? Small science can't pay. And how is cost-recovery a viable business model for data that has to survive lean grant times, anyway?

There's a scale problem involved, too. Because Big Science creates lots of basically homogeneous data, once you're past the storage problem, the other problems are fairly efficient to solve. Once you've sorted out how to describe Big Science data, the procedures can be institutionalized, solved en masse over the entire project. Set it and forget it. Human-resource cost per terabyte of data: minimal, even absurdly small.

Small science, by comparison, creates lots of little pieces of highly heterogeneous data. Without standards, each piece will need individual attention if it is to be adequately described and future-proofed. Human-resource cost per terabyte of data: frightening. Certainly, some of these data will be relatively simple to cope with, and I do expect standards and practices to improve generally; it won't always be necessary to explain the idea of metadata to people. Even so—this is high-touch, high-expense work, even when the actual storage requirements are minimal!

Where is the money to come from? I don't know. Until we all interrogate some of the assumptions underlying our business models, however, we won't be able to advance equitable solutions to the data-curation problem.


More like this


Thanks for sharing thoughts on this very important topic. There are almost certainly differences between so-called big science and small science, but it's really important to extrapolate from experience rather than assumptions. As you point out, it's hard to know definitively regarding the relative scale between the ends of the scientific spectrum. The reason that I strongly advocate that libraries should directly handle scientific data is that it will allow us to learn and build infrastructure accordingly rather than speculate without "kicking the tires."

I would submit that big science data may not be as homogeneous as you imagine and that addressing associated curation challenges may not be as efficient as we might hope. The reality is that curation of big science is an unsolved problem and if we don't embrace this reality we might assume it's easy or solved and not give it the attention it deserves.

I may be reading too much into your post but it seems that you're proposing that small science doesn't have the funding support of big science. I would submit that most of the medical sciences fall into the small sciences category. The last time I checked, the NIH annual budget was about four times the size of the NSF annual budget. I would love to see NIH come up with its own version of DataNet (maybe even four times larger!). Your point about moving beyond grant funding is well taken and libraries should carefully consider what this might mean for their own budget planning. But let's not forget that both big and small science are supported by a great deal of federal (and other grant) funding. If we can convince scientists that paying for data curation as part of their research is a good investment, we stand to generate a good deal of support (if only for the start-up or early days of infrastructure building). The best way to do this will be to deeply and properly understand the requirements.

Finally, I would strongly urge the library community to focus on storage systems. Yes, it is a technical issue primarily but if the library community does not engage the technical and vendor communities, especially toward defining requirements and policies that will influence storage development, we might end up with hardware that does not meet our evolving and challenging needs. Among others forums, the Preservation and Archiving Special Interest Group (PASIG) is a good place for such dialogue.

Sayeed Choudhury
Johns Hopkins University

By Sayeed Choudhury (not verified) on 19 Sep 2009 #permalink

Hi, Sayeed,

Thanks very much for your excellent comment. I wholeheartedly agree that libraries should not just throw up our hands and walk away from this problem. I have, however, already seen troubling signs myself that small science is getting lost (or worse, being intentionally abandoned) in the rush to serve Big Science.

I can understand the impulse, because money is tight all over, but it troubles me nonetheless. I don't believe that many processes that work well for Big Science will transfer well to small—you seem to disagree, which is fine; we are learning, and I could well be wrong!—and I'd rather see us working at both ends of the scale.

I didn't say this outright in my post, but what I foresee happening (partly because I seem to see it already happening) is that many Big Science projects will adopt the "embedded librarian" approach to data management, leaving everyone who can't afford to embed a whole librarian on their own. Will some grant shops be too small to do this, but big enough to throw some money at a campus data-curation operation? I expect so, sure.

Will that be enough money to fund the operation on a cost-recovery basis? Well, that is the question. I don't know the answer, but as you can imagine I'm keenly interested in it!

WRT embedded librarians: at SLA this year there was a presentation from one such. A geoscientist who specializes in "dirt" or the regolith of various assorted planets/dwarf planets/moons/asteroids and creating simulants decided he needed a librarian for his special database. Went to the NASA center library and asked to be issued one. Library went - "huh?" So he hired his own with his own money. So there's a very nice dirt database that you can e-mail and request access to, presumably supported by NASA money, and run by a small team with a librarian. Small science. I have no idea how this can be sustainable or if the librarian was just there for the start up. What happens if/when the original scientist loses interest? or wants to spend money on his own research instead of librarian salary and overhead (paid to a contracting company).

Sayeed @1, and WRT medical research:

If we can convince scientists that paying for data curation as part of their research is a good investment, we stand to generate a good deal of support (if only for the start-up or early days of infrastructure building). The best way to do this will be to deeply and properly understand the requirements.

In the UK, we seem to be employing the stick as well as the carrot - with both major public funders (the Medical Research Council, and Wellcome Trust) requiring grant holders have a "data management and sharing plan", but then also funding it.

The implicit business model, I think, is that it is not the urge to preserve data, but the requirement to share it now, and the consequent need to describe and explain it, that leads to wider data use.

One side effect, is the development of just-adequate standards. And once we have standards, the theory goes, the work of a repository is more or less trivial, right? Its function is reduced to being a portal for discovery, a bit of access management, and some nifty data refresh tricks to avoid bit-rust.

Hmmm, perhaps I've stumbled on why neither the MRC nor the WT have anything resembling a repository, currently ...

Christina, that is a great story. Thanks for it!

Neil, yes, there is a certain amount of magic-pixie-dust thinking in this arena. Just as with institutional repositories, I expect a certain amount of experimentation and (bluntly) failure will be necessary before we own up to the real human-resource costs of data management.