Many people, first confronted with the idea of data curation, think it’s a storage problem. A commonly-expressed notion is “give them enough disk and they’ll be fine.” Terabyte drives are cheap. Put one on the desk of every researcher, network it, and the problem evaporates, right?
Right?
Let me just ask a few questions about this approach.
- What happens when a drive on somebody’s desk fails?
- What do we do about the astronomers, physicists, and climatologists, who can eat a whole terabyte before breakfast and hardly notice?
- What do we do about the social scientists, medical researchers, and others who (necessarily) collect personally-identifiable and/or confidential information and are ethically and often legally forbidden from exposing it to the world?
- How do we manage access to the drive on somebody’s desk in the case of a collaboration across institutions? Who owns the data then? What about collaborations with industry, where trade-secret law may come into play?
- What do we do about certain varieties of lab science where the actual data-generating work is done by graduate students? Do they get terabyte drives too? (With my own two ears I’ve heard IT professionals say in all seriousness “They’re just graduate students; we don’t have to worry about their data.”)
- What happens to somebody’s drive and the stuff on it when she retires or moves between institutions?
- Is this system going to satisfy grant funders and journals who require data-sharing or data-sustainability plans?
- If we assume that the goal for at least some researchers is to make data available to the world at some juncture, who ensures that these drives and the material on them are discoverable (presumably via the Web)? Encoded adequately and sustainably, and in line with disciplinary data standards if any? Data-dictionaried, described, user-interfaced, and in a stable location with a stable identifier? (You can tell me “the researcher will!” I will laugh at you, but you can tell me that, sure.)
- The institution owns the drive. Does the institution own the data on it? If not, what can the institution realistically do to shepherd those data?
- What happens to somebody’s drive when a patent, trade-secret, or copyright lawsuit is in play? (“You can’t copyright data!” Hush, young padawan, and think of Europe.)
- Who’s to say this drive gets used for research data instead of somebody’s mp3 collection (or worse)? (Modify the question appropriately for music researchers, of course.)
- If the data need to go into a disciplinary or governmental repository of some kind, how does that happen?
- Who checks for and deals with bitrot, file-format migration, proprietary file formats, and similar hassles?
Whew. I’m afraid that’s more than a few questions. Sorry about that. I hope my point is clear: data curation is a complicated problem! “Let Them Eat Disk” will not solve it; big disk is unquestionably necessary, but far from sufficient.