The dangers of intertwingularity

By dsalo on August 31, 2009.

When I was but a young digital preservationist, I was presented with an archival problem I couldn't solve.

This should not sound unusual. It happens a lot, for all sorts of reasons. If I can keep a few people from falling into traps that make digital preservationists throw up their hands in despair, I'm happy.

Anyway, the problem was a website with some interactions coded in Javascript. If those interactions didn't work, the site made significantly less sense. (It could have been worse; even without the Javascript, the materials on the site were still reachable.)

The Javascript had been coded pre-ECMA standardization. Some of it was obsolete, so obsolete it just didn't work any more in modern browsers. Neither did the site.

I am not a Javascript programmer, so I had to turn down archiving the site. I wasn't happy about it, but sometimes life is like that.

It's always dangerous to intertwingle content and presentation. (That doesn't mean it's not sometimes necessary, of course… but necessity doesn't obviate the danger.) It's an order of magnitude more dangerous to intertwingle content, presentation, and behavior. Data outlasts code!

This has some implications for the data deluge. Consider, for example, the humble Excel spreadsheet, that common workhorse of data management. (Stop sneering, you statistics types with your fancy tools, and you database admins can hush too.) There's no behavior in an Excel spreadsheet, you may say; where's the problem?

Used a function anywhere in your spreadsheet? That's behavior, embedded right there inside your data where you least want it. Function definitions change among versions of Excel, and heaven help you if you move from Excel to Apple Numbers or OpenOffice Calc. Will your results still look the way they did when you first wrote the function? Who knows?

Built a chart or graph anywhere in your spreadsheet? Same problem, only more so.

On a slightly more abstract level, what's happening is that you're allowing your data analysis to rely on code that you didn't write, don't control, and can't document. This is obviously not ideal for long-term use of the data.

Disentangling behavior from data is very, very far from simple. Looking at this from the point of view of a would-be institutional-data librarian, I am flatly terrified by the variety of data that may come to my doorstep, and the concomitant explosion of behaviors that I may be expected to code and support.

I don't have an answer… but all of us who love data need to be asking these questions.

More like this

I absolutely agree. To see a good case of this issue, see http://www.ideals.uiuc.edu/handle/2142/13337 which is a zipped file (a bad place to start, but we unzip and archive these as separate files). But the data itself is highly interdependent and we need to work with the depositor to try to better separate out the macros and other things that are embedded within the file. Not pretty.

There have been several moderately recent attempts to provide frameworks that hope to avoid this problem. The basic idea is based on the Model - View - Controller idion. A separation is maintained between the Model - the data, the relationships and similar assertions, etc - the presentation - such as the visible fields, the functions used to produce their contents, etc - and the UI, or how the User interacts with teh Model via the View. XML and XHTML notations, and in-browser code written in Java, Javascript, Flash, or a couple of other plug-in technologies, were developed with this separation in mind.

But somehow far too many people fail to apply the technique. Perhaps because the authoring tools for doing so are still too primitive, too "hard to use". Perhaps because it is correspondingly too easy to mix the metaphors. But there is the technology out there, and there are some examples of its successful application.

Note: the MVC paradigm has been around at least since Alan Kay at PARC in the 70's used it for Smalltalk. Trouble was, it is only in the last decade, maybe even only the past five years, that our common desktop iron has been fast enough to run Smalltalk fast enough. The various desktop shells, Windows, X11, Quartz, also use the model, but not in a way that prohibits programmers from mixing the metaphors.

I think the best solution is to archive the environment along with the data, so if someone writes something in Excel 7.89.11b, then by golly, Excel 7.89.11b needs to be right there with it. This problem explodes in complexity when analysis, presentation, etc. is the result of complex interactions such as data from multiple sources being dynamically combined and/or data/presentation that result from a variety of software being used together. Still, if we want the Global Data Way Back Machine to work, it's got to run that old software.

Ah, the migration versus emulation debate. :)

Personally, I prefer data that doesn't depend on behaviors. In the case of my Excel spreadsheet, I'd want to wave a magic wand that calculates the functions and puts the results in the spreadsheet once and for all. If the magic wand also documented the exact calculation being performed, so much the better.

Data outlasts code. I'd rather focus on data than code. I have a notable bias, however, because I work pretty well at preserving data, but preserving code intimidates me. (Do NOT ask me what I will do when Python 3 goes gold and all my homegrown libraries break. Just don't ask.)

I think the best solution is to archive the environment along with the data, so if someone writes something in Excel 7.89.11b, then by golly, Excel 7.89.11b needs to be right there with it.

along, presumably, with the operating system and hardware to run Excel 7.89.11b on top of. after all, it won't be too much longer before nobody has a 32-bit computer to run a 32-bit-only version of Windoze 6.7.89q on, which was the last one known to run Excel 7.89.11b problem-free. (hey, a BSD Unix box with a game of Zork on it is tickling my memory for some reason...)

Do NOT ask me what I will do when Python 3 goes gold and all my homegrown libraries break. Just don't ask.

in the Python developers' defense, they do seem to be going to remarkable lengths to make porting old code as easy as possible --- even writing and publishing code-translator programs to help the task. that's far more than i've seen out of similar efforts to update programming languages.

and i've also seen what happens to programming languages that never get updated, or that attempt to keep backwards compatibility inviolate forever. not a pretty picture, that. Python's far better designed than most --- i know, i make most of my living writing Perl and PHP, as atonement for sins in some past life i'm sure --- but it can't stay that way forever unless it breaks with the old at some point.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

Universities Can Agree On All Hate Speech Except Antisemitism

More by this author

We're moving!

August 3, 2010

Looking for us? We're happy to say that we're part of the new Scientopia blogging collective. Come see us there!

Belated Zombie Day post

July 13, 2010

Oh, if I'd only had this picture for Zombie Day... Credit for the photo to UK Serials Group. Credit for the alteration of the speech bubble (you can see the original slide here if you care to) to Steve Lawson. Incidentally, I should have a postprint of an article based on this presentation up…

Promoting a comment: "Open and shared format"

July 8, 2010

Richard Wallis has taken my ribbing in good part, which I appreciate; his response is here and will reward your perusal. He also left a comment here, part of which I will make bold to reproduce: As to RDF underpinning the Linked Data Web - it is only as necessary as HTML was to the growth of the…

Small fry, blogging networks, and reputation

July 8, 2010

So, the PepsiCo blog thing. Right. Advance disclaimer: this is me talking, not either of my illustrious co-bloggers. We have not yet made a decision about what to do; one co-blogger is across the pond at a conference and the other is vacationing, so that discussion will have to wait a bit. This is…

I'd love to dance with you, but...

July 6, 2010

Richard Wallis of Talis (a library-systems vendor) posted The Data Publishing Three-Step to the Talis blog recently. My reaction to this particular brand of reductionism is… shall we say, impolitic. I just want to pat Richard on the head and croon "Who's the clever boy, then? You are! Yes, you are…