The Book of Trogool

Equipment and data curation

Monado of Science Notes commented on my irreplaceable-data post thusly:

It sounds as if the best thing to do in the short term is not throw away the old equipment. And to use the old equipment to copy digital media to newer forms… for which no one ever gets a budget, right?

It’s such a great comment that I want to unpack it a bit. As we work out our data praxis, this kind of question is exactly what we have to confront.

My first question is simple: What equipment are we talking about here? Using what media?

Libraries are wearily familiar with this question in (mostly) analog terms. We have microfilm, microform, and microfiche, and for all their wonderful archival qualities, they’re useless without the accompanying machines. We have sound recordings in everything from wax cylinder to vinyl to eight-track tape to cassette to digital?and in a few cases, we are stuck with analog media we can’t actually use, largely in hope that someday, somehow, money and opportunity will turn up to get the priceless information into a form usable now.

(Thought data-death was purely a digital phenomenon? Goodness, no.)

But there’s another axis to think about our equipment on: data production versus data retention. Our data production equipment may be as great a threat to the viability of the data produced as anything else. Instrument scientists, this means you. What is your instrument putting on your hard drive? Can anything besides your instrument and its bespoke software read it? If not, welcome to dusty data death.

This is just a specific instance of a general rule: for best performance, prefer open formats to proprietary, standardized and documented formats to the reverse, and popular formats to niche ones. Data persistence is a crapshoot. Load the dice.

Monado’s question, though, was about data retention equipment. My answer to this is actually relatively simple. All physical media fail and/or become obsolete. Don’t choose a physical medium based on its purported longevity; gold CDs are not a panacea! Pick a physical medium based on recoverability and ease of migration instead.

Recoverability first. To my mind, this has two parts: noticing problems and fixing them once they’ve been noticed. Gold CDs are horrible for noticing problems; to audit a collection of them, a human being has to sit down at a computer, pop each one in, and test it. Zip drives, Jaz drives, floppy drives, USB sticks?same problem. They’re hard to audit, so nobody audits them, so they fail silently, so the data on them gets lost. And that’s assuming the equipment to read them remains commercially available! (I got so burned by SyQuest? lost pretty much my entire undergraduate output. This is, I hasten to admit, no great loss to humanity, but it still hurts me personally.)

Fixing a problem once it’s been noticed is the provenance of a good backup system. Got one of those? I hope so.

Ease of migration should be fairly self-explanatory. The easier it is to move your data, the more likely you are to do it when need be. The easier (and less disruptive) it is to swap out a failing or obsolescing bit of your data infrastructure, the better. CDs and DVDs fail on this account, too; copying them is slow and requires a lot of human intervention.

The current state of the art is spinning-disk (with all appropriate reliability measures) with a backup system. The backup system has to my understanding typically been optical magneto-optical tape (thank you, commenter Markk), but “the cloud” is emerging (somewhat to my personal dismay) as an alternative. (Why does the cloud dismay me? Because it’s not making any reliability or sustainability claims yet. This may change?and anyway, what sky has only one cloud? The ideas behind the cloud are good; the implementation just needs work.)

In sum: the equipment you use matters a great deal to the longevity potential of your data. Choose wisely!

Comments

  1. #1 Markk
    August 18, 2009

    “Optical Tape”? What is that? Do you mean magnetic tape? LTO type tapes are what is common in large (100+ TB) size backup systems. And have been for a while. Barring social disaster you will be physically able to read them for at least a couple of generations – there will be equipment around to do it. It is what I recommended to keep medical equipment record archive when I had that job. I agree with everything else you said. Long term it is always the means to read and the understanding of the format that poses the problems.

  2. #2 Dorothea Salo
    August 18, 2009

    Obvious I’m not a sysadmin, isn’t it? Thanks for the correction. I was thinking “magneto-optical” and fumbled.