When I was but a young digital preservationist, I was presented with an archival problem I couldn’t solve.
This should not sound unusual. It happens a lot, for all sorts of reasons. If I can keep a few people from falling into traps that make digital preservationists throw up their hands in despair, I’m happy.
It’s always dangerous to intertwingle content and presentation. (That doesn’t mean it’s not sometimes necessary, of course? but necessity doesn’t obviate the danger.) It’s an order of magnitude more dangerous to intertwingle content, presentation, and behavior. Data outlasts code!
This has some implications for the data deluge. Consider, for example, the humble Excel spreadsheet, that common workhorse of data management. (Stop sneering, you statistics types with your fancy tools, and you database admins can hush too.) There’s no behavior in an Excel spreadsheet, you may say; where’s the problem?
Used a function anywhere in your spreadsheet? That’s behavior, embedded right there inside your data where you least want it. Function definitions change among versions of Excel, and heaven help you if you move from Excel to Apple Numbers or OpenOffice Calc. Will your results still look the way they did when you first wrote the function? Who knows?
Built a chart or graph anywhere in your spreadsheet? Same problem, only more so.
On a slightly more abstract level, what’s happening is that you’re allowing your data analysis to rely on code that you didn’t write, don’t control, and can’t document. This is obviously not ideal for long-term use of the data.
Disentangling behavior from data is very, very far from simple. Looking at this from the point of view of a would-be institutional-data librarian, I am flatly terrified by the variety of data that may come to my doorstep, and the concomitant explosion of behaviors that I may be expected to code and support.
I don’t have an answer? but all of us who love data need to be asking these questions.