Many people, first confronted with the idea of data curation, think it's a storage problem. A commonly-expressed notion is "give them enough disk and they'll be fine." Terabyte drives are cheap. Put one on the desk of every researcher, network it, and the problem evaporates, right?
Right?
Let me just ask a few questions about this approach.
- What happens when a drive on somebody's desk fails?
- What do we do about the astronomers, physicists, and climatologists, who can eat a whole terabyte before breakfast and hardly notice?
- What do we do about the social scientists, medical researchers, and others who (necessarily) collect personally-identifiable and/or confidential information and are ethically and often legally forbidden from exposing it to the world?
- How do we manage access to the drive on somebody's desk in the case of a collaboration across institutions? Who owns the data then? What about collaborations with industry, where trade-secret law may come into play?
- What do we do about certain varieties of lab science where the actual data-generating work is done by graduate students? Do they get terabyte drives too? (With my own two ears I've heard IT professionals say in all seriousness "They're just graduate students; we don't have to worry about their data.")
- What happens to somebody's drive and the stuff on it when she retires or moves between institutions?
- Is this system going to satisfy grant funders and journals who require data-sharing or data-sustainability plans?
- If we assume that the goal for at least some researchers is to make data available to the world at some juncture, who ensures that these drives and the material on them are discoverable (presumably via the Web)? Encoded adequately and sustainably, and in line with disciplinary data standards if any? Data-dictionaried, described, user-interfaced, and in a stable location with a stable identifier? (You can tell me "the researcher will!" I will laugh at you, but you can tell me that, sure.)
- The institution owns the drive. Does the institution own the data on it? If not, what can the institution realistically do to shepherd those data?
- What happens to somebody's drive when a patent, trade-secret, or copyright lawsuit is in play? ("You can't copyright data!" Hush, young padawan, and think of Europe.)
- Who's to say this drive gets used for research data instead of somebody's mp3 collection (or worse)? (Modify the question appropriately for music researchers, of course.)
- If the data need to go into a disciplinary or governmental repository of some kind, how does that happen?
- Who checks for and deals with bitrot, file-format migration, proprietary file formats, and similar hassles?
Whew. I'm afraid that's more than a few questions. Sorry about that. I hope my point is clear: data curation is a complicated problem! "Let Them Eat Disk" will not solve it; big disk is unquestionably necessary, but far from sufficient.
- Log in to post comments
More like this
Many of my readers will already have seen the Nature special issue on data, data curation, and data sharing. If you haven't, go now and read; it's impossible to overestimate the importance of this issue turning up in such a widely-read venue.
I read the opening of "Data sharing: Empty archives"…
The publisher Information Today runs a good and useful book series for librarians who find themselves with job duties they weren't expecting and don't feel prepared for. There's The Accidental Systems Librarian and The Accidental Library Marketer (that one's new) and a whole raft of other accidents…
My larger institution's (so not my place of work, but our parent org's) libraries had a fabulous get together Friday with a session on data curation. The speakers were: Clifford Lynch of the Coalition for Networked Information, Carole Palmer from UIUC, and Joel Bader from JHU and JHMI.
I tweeted,…
Genome Biology recently published a review, "The Case for Cloud Computing in Genome Informatics." What is cloud computing? Well:
This is a general term for computation-as-a-service. There are various different types of cloud computing, but the one that is closest to the way that computational…
What happens when someone develops a completely new data-storage format, and in five years you can no longer find a device capable of reading all the data you've stored? (Think of all that 7-bit tape in storage that NASA has, or even the humble 5.25" floppy. All effectively lost.)
It seems to me that the trick is not to store data at all, but to keep it continuously in use.
Absolutely correct, HP. There's no point to all this work if the data aren't usable and used (aka "we're not doing this for our health").
But 5 1/4" floppies aren't completely unsalvageable yet. Expensive to salvage, sure.
There are two big things missing with a simple one-drive external storage solution. Redundancy and backups.
I just installed my first terabyte drive last week and I've yet to put anything on it because I'm worried about what will happen when the drive fails. With the sub-$100 prices of TB drives, what one really needs is 2 drives in a mirrored RAID configuration and, potentially, a third one to sit in a drawer in case one of those fails.
That still doesn't get to what happens if lightning hits and fries both drives or if you lose an external drive. For that you need periodic backups to yet another drive, preferably one that's offsite from your current location (what happens if your office burns to the ground?).
Your other questions are institutional in nature and all institutions face similar problems and I'm completely unqualified to address them. What am I talking about, this is the Internet! Onward!
What I think any sizeable institution needs is a central massive database for storing data that needs to be saved and/or shared. Something on the order of a few petabytes if you have quite a few Terabyte-before-breakfast people on staff. The data could be made locally available depending on the security requirements and then synced up with the central database periodically. Then that database could be professionally managed with offsite backups. Far easier to do that with one big system than hundreds of disparate small ones. However, this is far from trivial to implement. The biggest hurdle isn't technological so much as organizational. How do you organize the data of hundreds of researchers working on dozens of projects and, more importantly, how do you get people to stick with that schema and actually use it?
#1 is about backups, right? The other side of the coin is missing: what happens when the failed drive goes to the garbage heap? The data is probably still recoverable in India or wherever electronic scrap is send to for recycling (see #3).
Un-backed-up big disk is pretty useless, yes, absolutely.
Mike, "institutional"? I see individual, institutional, departmental, disciplinary, funder, governmental, and IT problems. The original Big Ball o' Mud!
Lassi, that's an excellent point and thank you for making it. Anyone with sensitive data needs to be assured that it is both appropriately managed while it is live and properly disposed of once it reaches end-of-life. Records managers are experts at this sort of question.
Digital preservation is in its infancy despite a decade's effort. Large hard drives have all the problems you mentioned and more; they simply cost too much in electricity to keep spinning. A systematic program of migration to the next technology is the basic idea. We attack it as a generational problem - how do we preserve what we have in an appropriately retrievable media until the next generation of preservationists can take it over? Sun sponsors an interest group, Preservation and Archiving Special Interest Group, which includes many bright lights. None have any silver bullets.
Hear, hear.
But we can't let the perfect be the enemy of the good.