Quick thought: rejecting data or rejecting people?

By dsalo on October 16, 2009.

I'm still buried in translating a presentation into Spanish for Monday and finishing another in English for Wednesday, but here's a small thought to tide folks over, a thought that came to me shortly before my presentation at Access.

At the data-curation workshops I've been to, it has been axiomatic that "we can't afford to keep it all." Some fairly sophisticated judgment rubrics have been worked up, often based on the same kinds of judgment calls that special-collections librarians and archivists make when presented with collection opportunities. Is this dataset unique, or could it be recreated? Is it well-described? Is it in good shape? What is its importance to its field? Et cetera.

There's a problem with this mode of decision-making. It's a human problem. It's a problem that is endemic in the institutional-repository context, which is where I became acquainted with it.

The problem is perhaps best illustrated with a parable; I'll borrow Achaea University from Caveat Lector. Dr. Helen Troia comes to data archivist Ulysses Acqua with a pile of helter-skelter basketology data. Ulysses scrutinizes the dataset (with the help of basketology liaison Menelaus Fox), assesses its value honestly, and decides it just doesn't make the cut. He tells Dr. Troia so, stating his reasons in a professionally courteous fashion.

Will Dr. Troia come back to Ulysses five years later, when she's created the dataset that will revolutionize basketology forever? Not terribly likely, I'd say.

There are people behind every dataset, people who care deeply about their work. Rejecting their data is tantamount to rejecting their work, rejecting them as researchers. While such rejection may still be necessary, it should not be done lightly—it is an act with far-reaching political repercussions.

What, for example, will Dr. Troia tell her departmental colleague Dr. Andromache Memnon about Ulysses and the data service? What happens to the Basketology department's data should Dr. Troia become department chair?

Uncomfortable questions, but ones to take into account when designing and publicizing criteria for what data-curation services accept.

More like this

Graft or hybridize?

I've lived all my short career in academic libraries thus far on the new-service frontier. In so doing, I've looked around and learned a bit about how academic libraries, research libraries in particular, tend to manage new services.

The reverse-AOL maneuver and possible futures for serials

Back in the day, Time Warner merged with AOL. It turned out to be one of the worst merger ideas in the history of merger ideas, and I believe the evidence suggests that most mergers actually turn out to be clunkers! AOL was simply at the top of its orbit, nowhere but downhill to go.

I often wonder about this. When you are appraising stuff at the end of its life cycle, it is easier to make judgments about value (especially if the stuffâs creator has expired). But when you move the act of appraisal closer toward the creation of a thingâespecially a digital thingâthings become more complicated. Ulysses has many years of wandering ahead of him.

Although I believe it is true that we wonât be able to save everything, I wonder if there is way for us to accept everything (or most everything) provisionally, acknowledging that at some point we may decide not toâor find ourselves unable toâpreserve something.

A standard deed of gift form includes a provision about the doneeâs right to deaccession, but I think we need to be forthright about our commitments and limitations.

Do you think there is a difference in the attitudes of people bringing data to archivists compared to people sending papers off to journals? I know many of us who have been rejected from glamour journals,sometimes even before peer review, but we still keep trying.

Michael, you are absolutely right. Particularly if we adopt a RepoMMan-style service that handles data throughout its lifecycle, we can't make good judgments up-front! One way to approach the issue may be periodic weedings/re-evaluations at set time boundaries (five years/ten years/twenty years).

Isis, I do think there's a difference -- it's in who's doing whom favors. Faculty approach journals hat in hand. Faculty approach librarians and IT professionals as nobility expecting fealty and unquestioning service.

If it's 'axiomatic' that "we can't afford to keep it all", is anyone computing the costs for this, apart perhaps from David Rosenthal? We should not be seeking to make up-front selection judgements in the absence of costs or clear policy. Nor should we allow subjective evaluation to become the major cost. Digital content volumes will be large and growing, so we have to make use of digital indicators to create metrics that can aid selection, but only once we know the criteria and scope. This may involve science and economy more than librarianship.

Librarianship is arguably all about economy, Steve. Of course we've been computing costs, to the best of our ability. We have to; we're the ones who'll be expected to pay those costs!

I find myself viscerally upset by the opposition of "science" and "librarianship" in your comment, and your apparent disdain for the latter. Rather than lose my temper, however, I'll just ask where you think the disconnect is.

It would be nice to bypass "subjective evaluation," but how much of "objectivity" is mythical? Some things we can indeed know without doubt: whether data describe a non-recurrent phenomenon, for instance, such that they are unique and un-reconstructable. But once we start talking about "potential or actual impact" we're chasing our tails again.

We may have to chase our tails. However, I do think the keep-and-weed approach has some merit. If a dataset hasn't been published about (or worse still, downloaded or examined) in twenty years, chances are it's pretty useless. These are phenomena we can keep an eye on.

Dorothea, No disdain was intended. My comment was referring to this part of your blog: "Ulysses (a 'data archivist') scrutinizes the dataset, assesses its value honestly, and decides it just doesn't make the cut." On the basis of what? That's the part I would like to see expanded. The expansion, I suggested, involves science and economy, rather than the subjective judgement implied. I am happy to be corrected if the role distinctions you think are implied are wrong, and I may have created a distraction that is unhelpful, but the purpose of my enquiry is to see where and how economy (it's "all about economy") is being applied in this case. I am mindful that we only have a relatively few years of experience with digital data so we have little formal framework on which to base this, but everything suggests we will have to compute value, and cost, to decide how to manage digital data. This will use the new signals (e.g. usage, links, etc.) at our disposal as well as accounting for some of the intrinsic features of the data you highlight, rather than seeking one expert's view on one object, as we may have done up to now.

Aha! Okay, I'm with you now. Thanks for the clarification.

Yes, I took a bit of a shortcut through the decision-trees I've seen advanced on this subject. I will lay them out in a future post; that should (I hope) get us on the same page.

Even so, though, the human problem I point out in this post will persist. How do we reject one dataset while not rejecting its creator? How do we ensure that that creator will return to us with other, possibly better, datasets?

In this discussion, I can't help thinking about the multiple out-takes from Hollywood films that were recycled for their silver content. Out-takes from films such as The Wizard of Oz, Gone with the Wind would be worth their weight in gold today, they were worth nothing other than their silver content when they were recycled.

It is my understanding (but I am virtually completely ignorant of such things) that the largest "cost" of digital storage is the labor associated with making it usable.

There is a series of articles in the NYT about historic photos and how truly "historic" they are.

http://morris.blogs.nytimes.com/2009/10/18/the-case-of-the-inappropriat…

Keeping only the âprettyâ pictures may have been good business sense in the day, but as a historical record the photos are of limited value. Having all the negatives from a photo shoot would be better.

Quite right; for almost anything you name, the labor will be the biggest cost.

We're going to have to make some tough decisions as time passes, and without question we will make some wrong ones. I don't think we should shy away from that responsibility, because if we do, we will save nothing.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

We're moving!

August 3, 2010

Looking for us? We're happy to say that we're part of the new Scientopia blogging collective. Come see us there!

Belated Zombie Day post

July 13, 2010

Oh, if I'd only had this picture for Zombie Day... Credit for the photo to UK Serials Group. Credit for the alteration of the speech bubble (you can see the original slide here if you care to) to Steve Lawson. Incidentally, I should have a postprint of an article based on this presentation up…

Promoting a comment: "Open and shared format"

July 8, 2010

Richard Wallis has taken my ribbing in good part, which I appreciate; his response is here and will reward your perusal. He also left a comment here, part of which I will make bold to reproduce: As to RDF underpinning the Linked Data Web - it is only as necessary as HTML was to the growth of the…

Small fry, blogging networks, and reputation

July 8, 2010

So, the PepsiCo blog thing. Right. Advance disclaimer: this is me talking, not either of my illustrious co-bloggers. We have not yet made a decision about what to do; one co-blogger is across the pond at a conference and the other is vacationing, so that discussion will have to wait a bit. This is…

I'd love to dance with you, but...

July 6, 2010

Richard Wallis of Talis (a library-systems vendor) posted The Data Publishing Three-Step to the Talis blog recently. My reaction to this particular brand of reductionism is… shall we say, impolitic. I just want to pat Richard on the head and croon "Who's the clever boy, then? You are! Yes, you are…