Let Them Eat Disk

By dsalo on August 20, 2009.

Many people, first confronted with the idea of data curation, think it's a storage problem. A commonly-expressed notion is "give them enough disk and they'll be fine." Terabyte drives are cheap. Put one on the desk of every researcher, network it, and the problem evaporates, right?

Right?

Let me just ask a few questions about this approach.

What happens when a drive on somebody's desk fails?
What do we do about the astronomers, physicists, and climatologists, who can eat a whole terabyte before breakfast and hardly notice?
What do we do about the social scientists, medical researchers, and others who (necessarily) collect personally-identifiable and/or confidential information and are ethically and often legally forbidden from exposing it to the world?
How do we manage access to the drive on somebody's desk in the case of a collaboration across institutions? Who owns the data then? What about collaborations with industry, where trade-secret law may come into play?
What do we do about certain varieties of lab science where the actual data-generating work is done by graduate students? Do they get terabyte drives too? (With my own two ears I've heard IT professionals say in all seriousness "They're just graduate students; we don't have to worry about their data.")
What happens to somebody's drive and the stuff on it when she retires or moves between institutions?
Is this system going to satisfy grant funders and journals who require data-sharing or data-sustainability plans?
If we assume that the goal for at least some researchers is to make data available to the world at some juncture, who ensures that these drives and the material on them are discoverable (presumably via the Web)? Encoded adequately and sustainably, and in line with disciplinary data standards if any? Data-dictionaried, described, user-interfaced, and in a stable location with a stable identifier? (You can tell me "the researcher will!" I will laugh at you, but you can tell me that, sure.)
The institution owns the drive. Does the institution own the data on it? If not, what can the institution realistically do to shepherd those data?
What happens to somebody's drive when a patent, trade-secret, or copyright lawsuit is in play? ("You can't copyright data!" Hush, young padawan, and think of Europe.)
Who's to say this drive gets used for research data instead of somebody's mp3 collection (or worse)? (Modify the question appropriately for music researchers, of course.)
If the data need to go into a disciplinary or governmental repository of some kind, how does that happen?
Who checks for and deals with bitrot, file-format migration, proprietary file formats, and similar hassles?

Whew. I'm afraid that's more than a few questions. Sorry about that. I hope my point is clear: data curation is a complicated problem! "Let Them Eat Disk" will not solve it; big disk is unquestionably necessary, but far from sufficient.

More like this

Study: Night shift workers face dangerous risks on the drive home

Scientists are finding that night shift work comes with a range of particular health risks, from heart disease to diabetes to breast cancer. This month, another study joined the pack — this one on the risk of traffic crashes among those who head home from work at sunrise.

Effect of Night-Shift Nap on ER Residents and Nurses

This is another one of those studies that shows pretty much what you would expect. There are some surprises, though:

Driving skills deteriorate as conversation gets more difficult

There is little doubt that the cognitive demands of conversation can affect our awareness of the world around us. Everyone has a story of a near-miss collision with some clueless airhead driving who was jabbering away on the cell phone.

Sex Difference in Sex Drive

According to a story in the last issue of Psychological Science:

What happens when someone develops a completely new data-storage format, and in five years you can no longer find a device capable of reading all the data you've stored? (Think of all that 7-bit tape in storage that NASA has, or even the humble 5.25" floppy. All effectively lost.)

It seems to me that the trick is not to store data at all, but to keep it continuously in use.

Absolutely correct, HP. There's no point to all this work if the data aren't usable and used (aka "we're not doing this for our health").

But 5 1/4" floppies aren't completely unsalvageable yet. Expensive to salvage, sure.

There are two big things missing with a simple one-drive external storage solution. Redundancy and backups.

I just installed my first terabyte drive last week and I've yet to put anything on it because I'm worried about what will happen when the drive fails. With the sub-$100 prices of TB drives, what one really needs is 2 drives in a mirrored RAID configuration and, potentially, a third one to sit in a drawer in case one of those fails.

That still doesn't get to what happens if lightning hits and fries both drives or if you lose an external drive. For that you need periodic backups to yet another drive, preferably one that's offsite from your current location (what happens if your office burns to the ground?).

Your other questions are institutional in nature and all institutions face similar problems and I'm completely unqualified to address them. What am I talking about, this is the Internet! Onward!

What I think any sizeable institution needs is a central massive database for storing data that needs to be saved and/or shared. Something on the order of a few petabytes if you have quite a few Terabyte-before-breakfast people on staff. The data could be made locally available depending on the security requirements and then synced up with the central database periodically. Then that database could be professionally managed with offsite backups. Far easier to do that with one big system than hundreds of disparate small ones. However, this is far from trivial to implement. The biggest hurdle isn't technological so much as organizational. How do you organize the data of hundreds of researchers working on dozens of projects and, more importantly, how do you get people to stick with that schema and actually use it?

#1 is about backups, right? The other side of the coin is missing: what happens when the failed drive goes to the garbage heap? The data is probably still recoverable in India or wherever electronic scrap is send to for recycling (see #3).

Un-backed-up big disk is pretty useless, yes, absolutely.

Mike, "institutional"? I see individual, institutional, departmental, disciplinary, funder, governmental, and IT problems. The original Big Ball o' Mud!

Lassi, that's an excellent point and thank you for making it. Anyone with sensitive data needs to be assured that it is both appropriately managed while it is live and properly disposed of once it reaches end-of-life. Records managers are experts at this sort of question.

Digital preservation is in its infancy despite a decade's effort. Large hard drives have all the problems you mentioned and more; they simply cost too much in electricity to keep spinning. A systematic program of migration to the next technology is the basic idea. We attack it as a generational problem - how do we preserve what we have in an appropriately retrievable media until the next generation of preservationists can take it over? Sun sponsors an interest group, Preservation and Archiving Special Interest Group, which includes many bright lights. None have any silver bullets.

Hear, hear.

But we can't let the perfect be the enemy of the good.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

We're moving!

August 3, 2010

Looking for us? We're happy to say that we're part of the new Scientopia blogging collective. Come see us there!

Belated Zombie Day post

July 13, 2010

Oh, if I'd only had this picture for Zombie Day... Credit for the photo to UK Serials Group. Credit for the alteration of the speech bubble (you can see the original slide here if you care to) to Steve Lawson. Incidentally, I should have a postprint of an article based on this presentation up…

Promoting a comment: "Open and shared format"

July 8, 2010

Richard Wallis has taken my ribbing in good part, which I appreciate; his response is here and will reward your perusal. He also left a comment here, part of which I will make bold to reproduce: As to RDF underpinning the Linked Data Web - it is only as necessary as HTML was to the growth of the…

Small fry, blogging networks, and reputation

July 8, 2010

So, the PepsiCo blog thing. Right. Advance disclaimer: this is me talking, not either of my illustrious co-bloggers. We have not yet made a decision about what to do; one co-blogger is across the pond at a conference and the other is vacationing, so that discussion will have to wait a bit. This is…

I'd love to dance with you, but...

July 6, 2010

Richard Wallis of Talis (a library-systems vendor) posted The Data Publishing Three-Step to the Talis blog recently. My reaction to this particular brand of reductionism is… shall we say, impolitic. I just want to pat Richard on the head and croon "Who's the clever boy, then? You are! Yes, you are…