Many people, first confronted with the idea of data curation, think it's a storage problem. A commonly-expressed notion is "give them enough disk and they'll be fine." Terabyte drives are cheap. Put one on the desk of every researcher, network it, and the problem evaporates, right?
Right?
Let me just ask a few questions about this approach.
- What happens when a drive on somebody's desk fails?
- What do we do about the astronomers, physicists, and climatologists, who can eat a whole terabyte before breakfast and hardly notice?
- What do we do about the social scientists, medical researchers, and others who (necessarily) collect personally-identifiable and/or confidential information and are ethically and often legally forbidden from exposing it to the world?
- How do we manage access to the drive on somebody's desk in the case of a collaboration across institutions? Who owns the data then? What about collaborations with industry, where trade-secret law may come into play?
- What do we do about certain varieties of lab science where the actual data-generating work is done by graduate students? Do they get terabyte drives too? (With my own two ears I've heard IT professionals say in all seriousness "They're just graduate students; we don't have to worry about their data.")
- What happens to somebody's drive and the stuff on it when she retires or moves between institutions?
- Is this system going to satisfy grant funders and journals who require data-sharing or data-sustainability plans?
- If we assume that the goal for at least some researchers is to make data available to the world at some juncture, who ensures that these drives and the material on them are discoverable (presumably via the Web)? Encoded adequately and sustainably, and in line with disciplinary data standards if any? Data-dictionaried, described, user-interfaced, and in a stable location with a stable identifier? (You can tell me "the researcher will!" I will laugh at you, but you can tell me that, sure.)
- The institution owns the drive. Does the institution own the data on it? If not, what can the institution realistically do to shepherd those data?
- What happens to somebody's drive when a patent, trade-secret, or copyright lawsuit is in play? ("You can't copyright data!" Hush, young padawan, and think of Europe.)
- Who's to say this drive gets used for research data instead of somebody's mp3 collection (or worse)? (Modify the question appropriately for music researchers, of course.)
- If the data need to go into a disciplinary or governmental repository of some kind, how does that happen?
- Who checks for and deals with bitrot, file-format migration, proprietary file formats, and similar hassles?
Whew. I'm afraid that's more than a few questions. Sorry about that. I hope my point is clear: data curation is a complicated problem! "Let Them Eat Disk" will not solve it; big disk is unquestionably necessary, but far from sufficient.
- Log in to post comments
More like this
As I watch the environment around me for signs of data curation inside institutions, particularly in libraries, I seem to see two general classes of approach to the problem. One starts institution-wide, generally with a grand planning process. Another starts at the level of the individual…
I'm committed to a lot of different kinds of "open." This means that I can and do engage in tremendous acts of hair-splitting and pilpul with regard to them. "Gratis" versus "libre" open access? Free-speech versus free-beer software code? I'm your librarian; let's sit down and have that discussion…
Have some Friday tidbits!
An important biology dataset is losing NSF funding and may fold. Nor (as the article explains) is it the only one. It is impossible to overstate the desperate gravity of the data-sustainability question. Academic libraries, if we are not the white knights here—and we…
Peter Keane has a lengthy and worthwhile piece about the need for a "killer app" in data management. It's too meaty to relegate to a tidbits post; go read it and see what you think, then come back.
My reaction to the piece is complex, and I'm still rereading it to work through my own thoughts. Here…
What happens when someone develops a completely new data-storage format, and in five years you can no longer find a device capable of reading all the data you've stored? (Think of all that 7-bit tape in storage that NASA has, or even the humble 5.25" floppy. All effectively lost.)
It seems to me that the trick is not to store data at all, but to keep it continuously in use.
Absolutely correct, HP. There's no point to all this work if the data aren't usable and used (aka "we're not doing this for our health").
But 5 1/4" floppies aren't completely unsalvageable yet. Expensive to salvage, sure.
There are two big things missing with a simple one-drive external storage solution. Redundancy and backups.
I just installed my first terabyte drive last week and I've yet to put anything on it because I'm worried about what will happen when the drive fails. With the sub-$100 prices of TB drives, what one really needs is 2 drives in a mirrored RAID configuration and, potentially, a third one to sit in a drawer in case one of those fails.
That still doesn't get to what happens if lightning hits and fries both drives or if you lose an external drive. For that you need periodic backups to yet another drive, preferably one that's offsite from your current location (what happens if your office burns to the ground?).
Your other questions are institutional in nature and all institutions face similar problems and I'm completely unqualified to address them. What am I talking about, this is the Internet! Onward!
What I think any sizeable institution needs is a central massive database for storing data that needs to be saved and/or shared. Something on the order of a few petabytes if you have quite a few Terabyte-before-breakfast people on staff. The data could be made locally available depending on the security requirements and then synced up with the central database periodically. Then that database could be professionally managed with offsite backups. Far easier to do that with one big system than hundreds of disparate small ones. However, this is far from trivial to implement. The biggest hurdle isn't technological so much as organizational. How do you organize the data of hundreds of researchers working on dozens of projects and, more importantly, how do you get people to stick with that schema and actually use it?
#1 is about backups, right? The other side of the coin is missing: what happens when the failed drive goes to the garbage heap? The data is probably still recoverable in India or wherever electronic scrap is send to for recycling (see #3).
Un-backed-up big disk is pretty useless, yes, absolutely.
Mike, "institutional"? I see individual, institutional, departmental, disciplinary, funder, governmental, and IT problems. The original Big Ball o' Mud!
Lassi, that's an excellent point and thank you for making it. Anyone with sensitive data needs to be assured that it is both appropriately managed while it is live and properly disposed of once it reaches end-of-life. Records managers are experts at this sort of question.
Digital preservation is in its infancy despite a decade's effort. Large hard drives have all the problems you mentioned and more; they simply cost too much in electricity to keep spinning. A systematic program of migration to the next technology is the basic idea. We attack it as a generational problem - how do we preserve what we have in an appropriately retrievable media until the next generation of preservationists can take it over? Sun sponsors an interest group, Preservation and Archiving Special Interest Group, which includes many bright lights. None have any silver bullets.
Hear, hear.
But we can't let the perfect be the enemy of the good.