Migration versus emulation

By dsalo on September 2, 2009.

Just a quickie post today—

In answer to my post about intertwingularity, commenter Andy Arenson suggested that the way to rescue an Excel spreadsheet whose functions or other behaviors depended on a particular version of Excel was to keep that specific version of Excel runnable indefinitely.

This is called "emulation," and it assuredly has its place in the digital-preservation pantheon. Some digital cultural artifacts are practically all behavior—games, for instance—and just hanging onto the source code honestly doesn't do very much good. The artifact is what happens when that code is run, which means preserving it means keeping that code runnable, which in turn means preserving its runtime environment as best we can.

No mere bagatelle, this. If you turn up your nose at games (which you really, really shouldn't), consider the humble Hypercard stack from the 1990s. A good many enterprising artists and designers built rather remarkable things on it, as well as over other bits of the early Macintosh systems environment—and all those things are right this minute in danger of disappearing forever because we can't emulate that environment sufficiently well to rescue them.

For most data, though, I honestly prefer a "migration" strategy, in which format obsolescence is fought by modifying files to keep them usable in modern hardware and software environments. Hardcore emulationists disagree with me; I've seen articles boasting that any environment in the history of computing is trivial to emulate, so why even bother with migration? Frankly, I don't believe a word of it. If it were that trivial, it would have been done already. It hasn't.

I prefer migration because emulation feels like putting the data in a museum: look all you want, but don't touch. Data should be touchable, rearrangeable, mashup-able; a good migration will keep them so. Also, in general migration is much less of a reach for memory organizations than emulation. Take me, for instance. I'm a tolerably talented data migrator. I can't do anything with emulation.

Migration itself is not always trivial and can be lossy. My friend Tim Donohue developed (and won a conference prize with) a DSpace hack that sends Microsoft Office files through a copy of OpenOffice.org running on the DSpace server, saving ODF versions of the files to DSpace along with the Office versions. Worked like a charm, as far as it went. What was the problem? FONTS. Because the server had a minimal font complement at best, the ODF files came out looking unusably horrible.

Migration is sometimes impossible, if the origin format is proprietary, opaque, or otherwise not reverse-engineerable. Unfortunately, emulation has limited if any success in this situation as well; if the file format is obfuscated, so is the software environment, generally!

Of course, the gold standard is a research workflow that respects data enough to put thought and care into describing it and using future-friendly formats right from the beginning. We don't live in that world, and we may never live in that world… so the migration-versus-emulation wars are only beginning.

More like this

Yeah, emulation is a manky piece of business. Even if you really and truly nail down any strange hardware that needs to be emulated, heaven help you if there isn't any dependency documentation, or of those dependencies are currently having a chat with /dev/null.

Pouring the actual content into a new container seems like a definite winner to me.

Hi,

Firstly, great blog and great posts!
Secondly, I think there is a lot of potential for emulation for things such as databases and other complicated digital objects. Furthermore I think that ought not be that difficult. Virtual machine software products like virtualbox let users install any version of windows, mac os, linux, dos etc and run other software on top of that from within most current desktop environments.

Admittedly maintaining an understanding of the products involved and the various dependencies amongst the particular software tools will be a challenge, but not an insurmountable one. After all, thats what user manuals and software libraries are for!

Thanks again,

Euan Cochrane

Isn't the environment, not just the functionality of whatever worked in that environment, something that you would want to preserve as well?

Migration may be ok for standalone objects like text documents, images, video etc. but it presents all sorts of conceptual problems for more complex types of data. I am involved with a large national web-archiving project and it's hard to see how one could possibly migrate such an archive en-masse. Individual embedded objects could be migrated but what about the structure, layout, and functionality of whole pages and sites? What you're talking about is taking a page written when, say, IE6 was the most common browser and migrating it so that users of IE8 or Firefox3.5 will somehow have the same "user-experience" as the original audience. That's going to be very tricky.

Kyle, "sometimes" is the best answer I have. For a game that won't work otherwise? Absolutely. For a spreadsheet? I'd rather migrate the data to something more durable if possible.

Csrster, you're right; it's a difficult problem, and the example you give is perfectly apropos. When I'm faced with a website-archiving problem, I tend to go old-school: standardize the markup, staticize the site, and save it that way. Obviously, there are sites this technique won't work for! For many, however, it's an acceptable compromise, one that makes it unlikely the code will ever have to be touched again.

That's why it's smart to think about durability from the very beginning of a project. There's nothing preventing people from writing standards-compliant code and avoiding Flash and suchlike perishable techniques from the outset -- except not realizing the consequences.

Two comments: first, migration is needed less than you think (g see David Rosenthal's blog posts based on his CNI talk). Second, emulation in many contexts needs the original software to continue working. In particular, it needs to be licensed indefinitely. Not easy for many proprietary softwares, whose business model can depend on getting you to re-license what you had, and sometimes on making earlier versions not work....

I have my reservations about Rosenthal, but in the main he is right (as you are). I don't fuss too terribly much about mainstream formats. Proprietary formats in instrument science hurt my heart, though.

Very good point about the licensing. One more danger of proprietary software and data formats!

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

Fossil discovery is a new missing link in modern fish evolution

More by this author

We're moving!

August 3, 2010

Looking for us? We're happy to say that we're part of the new Scientopia blogging collective. Come see us there!

Belated Zombie Day post

July 13, 2010

Oh, if I'd only had this picture for Zombie Day... Credit for the photo to UK Serials Group. Credit for the alteration of the speech bubble (you can see the original slide here if you care to) to Steve Lawson. Incidentally, I should have a postprint of an article based on this presentation up…

Promoting a comment: "Open and shared format"

July 8, 2010

Richard Wallis has taken my ribbing in good part, which I appreciate; his response is here and will reward your perusal. He also left a comment here, part of which I will make bold to reproduce: As to RDF underpinning the Linked Data Web - it is only as necessary as HTML was to the growth of the…

Small fry, blogging networks, and reputation

July 8, 2010

So, the PepsiCo blog thing. Right. Advance disclaimer: this is me talking, not either of my illustrious co-bloggers. We have not yet made a decision about what to do; one co-blogger is across the pond at a conference and the other is vacationing, so that discussion will have to wait a bit. This is…

I'd love to dance with you, but...

July 6, 2010

Richard Wallis of Talis (a library-systems vendor) posted The Data Publishing Three-Step to the Talis blog recently. My reaction to this particular brand of reductionism is… shall we say, impolitic. I just want to pat Richard on the head and croon "Who's the clever boy, then? You are! Yes, you are…