When I was but a young digital preservationist, I was presented with an archival problem I couldn't solve.
This should not sound unusual. It happens a lot, for all sorts of reasons. If I can keep a few people from falling into traps that make digital preservationists throw up their hands in despair, I'm happy.
Anyway, the problem was a website with some interactions coded in Javascript. If those interactions didn't work, the site made significantly less sense. (It could have been worse; even without the Javascript, the materials on the site were still reachable.)
The Javascript had been coded pre-ECMA standardization. Some of it was obsolete, so obsolete it just didn't work any more in modern browsers. Neither did the site.
I am not a Javascript programmer, so I had to turn down archiving the site. I wasn't happy about it, but sometimes life is like that.
It's always dangerous to intertwingle content and presentation. (That doesn't mean it's not sometimes necessary, of course… but necessity doesn't obviate the danger.) It's an order of magnitude more dangerous to intertwingle content, presentation, and behavior. Data outlasts code!
This has some implications for the data deluge. Consider, for example, the humble Excel spreadsheet, that common workhorse of data management. (Stop sneering, you statistics types with your fancy tools, and you database admins can hush too.) There's no behavior in an Excel spreadsheet, you may say; where's the problem?
Used a function anywhere in your spreadsheet? That's behavior, embedded right there inside your data where you least want it. Function definitions change among versions of Excel, and heaven help you if you move from Excel to Apple Numbers or OpenOffice Calc. Will your results still look the way they did when you first wrote the function? Who knows?
Built a chart or graph anywhere in your spreadsheet? Same problem, only more so.
On a slightly more abstract level, what's happening is that you're allowing your data analysis to rely on code that you didn't write, don't control, and can't document. This is obviously not ideal for long-term use of the data.
Disentangling behavior from data is very, very far from simple. Looking at this from the point of view of a would-be institutional-data librarian, I am flatly terrified by the variety of data that may come to my doorstep, and the concomitant explosion of behaviors that I may be expected to code and support.
I don't have an answer… but all of us who love data need to be asking these questions.
- Log in to post comments
I absolutely agree. To see a good case of this issue, see http://www.ideals.uiuc.edu/handle/2142/13337 which is a zipped file (a bad place to start, but we unzip and archive these as separate files). But the data itself is highly interdependent and we need to work with the depositor to try to better separate out the macros and other things that are embedded within the file. Not pretty.
There have been several moderately recent attempts to provide frameworks that hope to avoid this problem. The basic idea is based on the Model - View - Controller idion. A separation is maintained between the Model - the data, the relationships and similar assertions, etc - the presentation - such as the visible fields, the functions used to produce their contents, etc - and the UI, or how the User interacts with teh Model via the View. XML and XHTML notations, and in-browser code written in Java, Javascript, Flash, or a couple of other plug-in technologies, were developed with this separation in mind.
But somehow far too many people fail to apply the technique. Perhaps because the authoring tools for doing so are still too primitive, too "hard to use". Perhaps because it is correspondingly too easy to mix the metaphors. But there is the technology out there, and there are some examples of its successful application.
Note: the MVC paradigm has been around at least since Alan Kay at PARC in the 70's used it for Smalltalk. Trouble was, it is only in the last decade, maybe even only the past five years, that our common desktop iron has been fast enough to run Smalltalk fast enough. The various desktop shells, Windows, X11, Quartz, also use the model, but not in a way that prohibits programmers from mixing the metaphors.
I think the best solution is to archive the environment along with the data, so if someone writes something in Excel 7.89.11b, then by golly, Excel 7.89.11b needs to be right there with it. This problem explodes in complexity when analysis, presentation, etc. is the result of complex interactions such as data from multiple sources being dynamically combined and/or data/presentation that result from a variety of software being used together. Still, if we want the Global Data Way Back Machine to work, it's got to run that old software.
Ah, the migration versus emulation debate. :)
Personally, I prefer data that doesn't depend on behaviors. In the case of my Excel spreadsheet, I'd want to wave a magic wand that calculates the functions and puts the results in the spreadsheet once and for all. If the magic wand also documented the exact calculation being performed, so much the better.
Data outlasts code. I'd rather focus on data than code. I have a notable bias, however, because I work pretty well at preserving data, but preserving code intimidates me. (Do NOT ask me what I will do when Python 3 goes gold and all my homegrown libraries break. Just don't ask.)
along, presumably, with the operating system and hardware to run Excel 7.89.11b on top of. after all, it won't be too much longer before nobody has a 32-bit computer to run a 32-bit-only version of Windoze 6.7.89q on, which was the last one known to run Excel 7.89.11b problem-free. (hey, a BSD Unix box with a game of Zork on it is tickling my memory for some reason...)
in the Python developers' defense, they do seem to be going to remarkable lengths to make porting old code as easy as possible --- even writing and publishing code-translator programs to help the task. that's far more than i've seen out of similar efforts to update programming languages.
and i've also seen what happens to programming languages that never get updated, or that attempt to keep backwards compatibility inviolate forever. not a pretty picture, that. Python's far better designed than most --- i know, i make most of my living writing Perl and PHP, as atonement for sins in some past life i'm sure --- but it can't stay that way forever unless it breaks with the old at some point.