The Book of Trogool

Let Them Eat Disk

Many people, first confronted with the idea of data curation, think it’s a storage problem. A commonly-expressed notion is “give them enough disk and they’ll be fine.” Terabyte drives are cheap. Put one on the desk of every researcher, network it, and the problem evaporates, right?

Right?

Let me just ask a few questions about this approach.

  1. What happens when a drive on somebody’s desk fails?
  2. What do we do about the astronomers, physicists, and climatologists, who can eat a whole terabyte before breakfast and hardly notice?
  3. What do we do about the social scientists, medical researchers, and others who (necessarily) collect personally-identifiable and/or confidential information and are ethically and often legally forbidden from exposing it to the world?
  4. How do we manage access to the drive on somebody’s desk in the case of a collaboration across institutions? Who owns the data then? What about collaborations with industry, where trade-secret law may come into play?
  5. What do we do about certain varieties of lab science where the actual data-generating work is done by graduate students? Do they get terabyte drives too? (With my own two ears I’ve heard IT professionals say in all seriousness “They’re just graduate students; we don’t have to worry about their data.”)
  6. What happens to somebody’s drive and the stuff on it when she retires or moves between institutions?
  7. Is this system going to satisfy grant funders and journals who require data-sharing or data-sustainability plans?
  8. If we assume that the goal for at least some researchers is to make data available to the world at some juncture, who ensures that these drives and the material on them are discoverable (presumably via the Web)? Encoded adequately and sustainably, and in line with disciplinary data standards if any? Data-dictionaried, described, user-interfaced, and in a stable location with a stable identifier? (You can tell me “the researcher will!” I will laugh at you, but you can tell me that, sure.)
  9. The institution owns the drive. Does the institution own the data on it? If not, what can the institution realistically do to shepherd those data?
  10. What happens to somebody’s drive when a patent, trade-secret, or copyright lawsuit is in play? (“You can’t copyright data!” Hush, young padawan, and think of Europe.)
  11. Who’s to say this drive gets used for research data instead of somebody’s mp3 collection (or worse)? (Modify the question appropriately for music researchers, of course.)
  12. If the data need to go into a disciplinary or governmental repository of some kind, how does that happen?
  13. Who checks for and deals with bitrot, file-format migration, proprietary file formats, and similar hassles?

Whew. I’m afraid that’s more than a few questions. Sorry about that. I hope my point is clear: data curation is a complicated problem! “Let Them Eat Disk” will not solve it; big disk is unquestionably necessary, but far from sufficient.

Comments

  1. #1 HP
    August 20, 2009

    What happens when someone develops a completely new data-storage format, and in five years you can no longer find a device capable of reading all the data you’ve stored? (Think of all that 7-bit tape in storage that NASA has, or even the humble 5.25″ floppy. All effectively lost.)

    It seems to me that the trick is not to store data at all, but to keep it continuously in use.

  2. #2 Dorothea Salo
    August 20, 2009

    Absolutely correct, HP. There’s no point to all this work if the data aren’t usable and used (aka “we’re not doing this for our health”).

    But 5 1/4″ floppies aren’t completely unsalvageable yet. Expensive to salvage, sure.

  3. #3 Mike Webster
    August 20, 2009

    There are two big things missing with a simple one-drive external storage solution. Redundancy and backups.

    I just installed my first terabyte drive last week and I’ve yet to put anything on it because I’m worried about what will happen when the drive fails. With the sub-$100 prices of TB drives, what one really needs is 2 drives in a mirrored RAID configuration and, potentially, a third one to sit in a drawer in case one of those fails.

    That still doesn’t get to what happens if lightning hits and fries both drives or if you lose an external drive. For that you need periodic backups to yet another drive, preferably one that’s offsite from your current location (what happens if your office burns to the ground?).

    Your other questions are institutional in nature and all institutions face similar problems and I’m completely unqualified to address them. What am I talking about, this is the Internet! Onward!

    What I think any sizeable institution needs is a central massive database for storing data that needs to be saved and/or shared. Something on the order of a few petabytes if you have quite a few Terabyte-before-breakfast people on staff. The data could be made locally available depending on the security requirements and then synced up with the central database periodically. Then that database could be professionally managed with offsite backups. Far easier to do that with one big system than hundreds of disparate small ones. However, this is far from trivial to implement. The biggest hurdle isn’t technological so much as organizational. How do you organize the data of hundreds of researchers working on dozens of projects and, more importantly, how do you get people to stick with that schema and actually use it?

  4. #4 Lassi Hippeläinen
    August 21, 2009

    #1 is about backups, right? The other side of the coin is missing: what happens when the failed drive goes to the garbage heap? The data is probably still recoverable in India or wherever electronic scrap is send to for recycling (see #3).

  5. #5 Dorothea Salo
    August 21, 2009

    Un-backed-up big disk is pretty useless, yes, absolutely.

    Mike, “institutional”? I see individual, institutional, departmental, disciplinary, funder, governmental, and IT problems. The original Big Ball o’ Mud!

    Lassi, that’s an excellent point and thank you for making it. Anyone with sensitive data needs to be assured that it is both appropriately managed while it is live and properly disposed of once it reaches end-of-life. Records managers are experts at this sort of question.

  6. #6 Sam
    August 24, 2009

    Digital preservation is in its infancy despite a decade’s effort. Large hard drives have all the problems you mentioned and more; they simply cost too much in electricity to keep spinning. A systematic program of migration to the next technology is the basic idea. We attack it as a generational problem – how do we preserve what we have in an appropriately retrievable media until the next generation of preservationists can take it over? Sun sponsors an interest group, Preservation and Archiving Special Interest Group, which includes many bright lights. None have any silver bullets.

  7. #7 Dorothea Salo
    August 24, 2009

    Hear, hear.

    But we can’t let the perfect be the enemy of the good.

The site is currently under maintenance and will be back shortly. New comments have been disabled during this time, please check back soon.