I'm still buried in translating a presentation into Spanish for Monday and finishing another in English for Wednesday, but here's a small thought to tide folks over, a thought that came to me shortly before my presentation at Access.
At the data-curation workshops I've been to, it has been axiomatic that "we can't afford to keep it all." Some fairly sophisticated judgment rubrics have been worked up, often based on the same kinds of judgment calls that special-collections librarians and archivists make when presented with collection opportunities. Is this dataset unique, or could it be recreated? Is it well-described? Is it in good shape? What is its importance to its field? Et cetera.
There's a problem with this mode of decision-making. It's a human problem. It's a problem that is endemic in the institutional-repository context, which is where I became acquainted with it.
The problem is perhaps best illustrated with a parable; I'll borrow Achaea University from Caveat Lector. Dr. Helen Troia comes to data archivist Ulysses Acqua with a pile of helter-skelter basketology data. Ulysses scrutinizes the dataset (with the help of basketology liaison Menelaus Fox), assesses its value honestly, and decides it just doesn't make the cut. He tells Dr. Troia so, stating his reasons in a professionally courteous fashion.
Will Dr. Troia come back to Ulysses five years later, when she's created the dataset that will revolutionize basketology forever? Not terribly likely, I'd say.
There are people behind every dataset, people who care deeply about their work. Rejecting their data is tantamount to rejecting their work, rejecting them as researchers. While such rejection may still be necessary, it should not be done lightly—it is an act with far-reaching political repercussions.
What, for example, will Dr. Troia tell her departmental colleague Dr. Andromache Memnon about Ulysses and the data service? What happens to the Basketology department's data should Dr. Troia become department chair?
Uncomfortable questions, but ones to take into account when designing and publicizing criteria for what data-curation services accept.
I often wonder about this. When you are appraising stuff at the end of its life cycle, it is easier to make judgments about value (especially if the stuffâs creator has expired). But when you move the act of appraisal closer toward the creation of a thingâespecially a digital thingâthings become more complicated. Ulysses has many years of wandering ahead of him.
Although I believe it is true that we wonât be able to save everything, I wonder if there is way for us to accept everything (or most everything) provisionally, acknowledging that at some point we may decide not toâor find ourselves unable toâpreserve something.
A standard deed of gift form includes a provision about the doneeâs right to deaccession, but I think we need to be forthright about our commitments and limitations.
Do you think there is a difference in the attitudes of people bringing data to archivists compared to people sending papers off to journals? I know many of us who have been rejected from glamour journals,sometimes even before peer review, but we still keep trying.
Michael, you are absolutely right. Particularly if we adopt a RepoMMan-style service that handles data throughout its lifecycle, we can't make good judgments up-front! One way to approach the issue may be periodic weedings/re-evaluations at set time boundaries (five years/ten years/twenty years).
Isis, I do think there's a difference -- it's in who's doing whom favors. Faculty approach journals hat in hand. Faculty approach librarians and IT professionals as nobility expecting fealty and unquestioning service.
If it's 'axiomatic' that "we can't afford to keep it all", is anyone computing the costs for this, apart perhaps from David Rosenthal? We should not be seeking to make up-front selection judgements in the absence of costs or clear policy. Nor should we allow subjective evaluation to become the major cost. Digital content volumes will be large and growing, so we have to make use of digital indicators to create metrics that can aid selection, but only once we know the criteria and scope. This may involve science and economy more than librarianship.
Librarianship is arguably all about economy, Steve. Of course we've been computing costs, to the best of our ability. We have to; we're the ones who'll be expected to pay those costs!
I find myself viscerally upset by the opposition of "science" and "librarianship" in your comment, and your apparent disdain for the latter. Rather than lose my temper, however, I'll just ask where you think the disconnect is.
It would be nice to bypass "subjective evaluation," but how much of "objectivity" is mythical? Some things we can indeed know without doubt: whether data describe a non-recurrent phenomenon, for instance, such that they are unique and un-reconstructable. But once we start talking about "potential or actual impact" we're chasing our tails again.
We may have to chase our tails. However, I do think the keep-and-weed approach has some merit. If a dataset hasn't been published about (or worse still, downloaded or examined) in twenty years, chances are it's pretty useless. These are phenomena we can keep an eye on.
Dorothea, No disdain was intended. My comment was referring to this part of your blog: "Ulysses (a 'data archivist') scrutinizes the dataset, assesses its value honestly, and decides it just doesn't make the cut." On the basis of what? That's the part I would like to see expanded. The expansion, I suggested, involves science and economy, rather than the subjective judgement implied. I am happy to be corrected if the role distinctions you think are implied are wrong, and I may have created a distraction that is unhelpful, but the purpose of my enquiry is to see where and how economy (it's "all about economy") is being applied in this case. I am mindful that we only have a relatively few years of experience with digital data so we have little formal framework on which to base this, but everything suggests we will have to compute value, and cost, to decide how to manage digital data. This will use the new signals (e.g. usage, links, etc.) at our disposal as well as accounting for some of the intrinsic features of the data you highlight, rather than seeking one expert's view on one object, as we may have done up to now.
Aha! Okay, I'm with you now. Thanks for the clarification.
Yes, I took a bit of a shortcut through the decision-trees I've seen advanced on this subject. I will lay them out in a future post; that should (I hope) get us on the same page.
Even so, though, the human problem I point out in this post will persist. How do we reject one dataset while not rejecting its creator? How do we ensure that that creator will return to us with other, possibly better, datasets?
In this discussion, I can't help thinking about the multiple out-takes from Hollywood films that were recycled for their silver content. Out-takes from films such as The Wizard of Oz, Gone with the Wind would be worth their weight in gold today, they were worth nothing other than their silver content when they were recycled.
It is my understanding (but I am virtually completely ignorant of such things) that the largest "cost" of digital storage is the labor associated with making it usable.
There is a series of articles in the NYT about historic photos and how truly "historic" they are.
Keeping only the âprettyâ pictures may have been good business sense in the day, but as a historical record the photos are of limited value. Having all the negatives from a photo shoot would be better.
Quite right; for almost anything you name, the labor will be the biggest cost.
We're going to have to make some tough decisions as time passes, and without question we will make some wrong ones. I don't think we should shy away from that responsibility, because if we do, we will save nothing.