Magical thinking in data curation

By dsalo on January 2, 2010.

Peter Keane has a lengthy and worthwhile piece about the need for a "killer app" in data management. It's too meaty to relegate to a tidbits post; go read it and see what you think, then come back.

My reaction to the piece is complex, and I'm still rereading it to work through my own thoughts. Here's a beginning, however.

In at least some fields, data are their own killer app. I expect the number of fields to grow over time, especially as socio-structural carrots and sticks for data-sharing grow, which I expect will happen. We don't have to talk about the uses for data in the subjunctive mood; there are examples, real live present-day examples, of data and data-sharing assisting the progress of research, as well as examples of the lack of data retarding it. So I'm not at all sure we need to prove ab initio that keeping data is a good thing. What we do need to prove is a little more subtle: that putting effort and resources into keeping data is a good thing.

Believe me, those two propositions are absolutely not equivalent. Especially these days, there's any number of good things that nobody wants to put effort and resources into. The newer the good thing, the harder it is to win investment in it; old things have established, often powerful incumbencies to fight for them, and they have the innate advantages of tradition and custom as well.

For all the talk about speculative investment, risk-taking, and innovation—I generally don't bet on real investment in novelty in academia, and I'm even less likely to bet on it in a resource-constrained environment. Face-saving investment of nominal resources, yes; usually not enough resources to matter, because setting up a novelty to fail means that novelty can conveniently be done away with a little later—after all, it didn't work, did it?

I apologize for the cynicism inherent in this argument; I wouldn't be so cynical if I hadn't witnessed this very syndrome quite a few times in quite a few different contexts myself. (You'll forgive me for not offering concrete examples in a public context, I'm sure.) But the fact remains: those of us who advocate novelty in academia have to be terribly careful about how we do it. As the opening of this post may hint, I prefer evidence and example to speculation and futurology as advocacy tools.

Two kinds of nominally-attractive argument, both of which can be found in Keane's post, tend to actively scotch investment in new things. One is the very title of his post: "we need to find a killer app." The other is "it'll be effortless!" The latter especially strikes me as magical thinking, and I'm afraid I consider both counterproductive in the current organizational environment.

Let's pretend for a moment that we're administrators. Someone comes to us saying "I need to build a killer app for data curation." What's data curation? is the natural first question, and why do I care about it right here and now? is the natural sequel. You see the dilemma already: if what data curation needs is a "killer app," but nobody will invest in the building of said app until data curation itself is viewed as a strategic necessity, well…

In short, we need to justify data curation on its own merits, not because it's going to be great someday, really, promise! I think that's quite feasible, mind you. There's plenty of jam today; we don't need to rely on hypothetical jam tomorrow—and doing so may actively harm our cause.

On to the question of effortlessness, where the magical thinking comes thick and fast and from every direction. My cards on the table: data curation costs effort. Can we build tools to make it less effortful? Sure. Should we? Absolutely. Will that ever reduce the effort to zero? Absolutely not. TANSTAAFL, and when we try to imply that there is, we cut our own throats. If data curation is free, who needs data curators?

Right now, I see epic tons of magical thinking about data curation in academia generally and in the researcher community particularly. The idea that it can just be left to graduate students. The idea that information management can be taught in a week's intensive seminar. Metadata, who needs metadata in an age of search engines? Et cetera, and if you'd like concrete examples of some of this magical thinking on the part of researchers, try this JISC report or this Australian report, which are crawling with it.

Keane's "killer app," which will apparently serve every kind of research data in every discipline equally, bothers me a lot. Many a time, I've had poor hapless graduate students call me who have had a passel of research data dumped on them to manage with not the least idea what they should do with it. They assume, because the researchers they work for assume, that there is some kind of killer-app magic bullet that will take an unholy mess of undescribed, undifferentiated digital stuff and miraculously organize it.

There isn't. There is not. Not DSpace, not Fedora, not Drupal, not Vignette, not anything you name. Data curation costs effort. Data curation requires skill, time, process change (a tall order all by itself), and resources. TANSTAAFL.

If I still haven't convinced you, consider this. Around about 2003, libraries were promised that a cheap, easy software tool was going to provide universal open access with minimal ("five minutes per paper!") investment of time and effort. Sounds good, they thought, and many signed on.

The result was the institutional repository.

That's why I'm desperately leery of telling anyone that data curation is going to save effort in the short term, much less that it'll be cheap or easy. We went that route once, and it blew up in our face.

More like this

When it comes to data and the conservation of data, I think the library preservationists have the most challenging job of all the profession. So many unknowns. Not only do they need to deal with intangible stuff (digital bits) but they also have to deal with rapidly changing hardware and software to manipulate those bits. Relatively speaking, analog materials were much easier.

I tend to agree. It is wishful thinking to believe there will be a killer app for data preservation.

--
ELM

Hi Dorothea-

Thanks for this response! Great stuff and lots to think about.

I'll suggest an alternative interpretation of my piece though, and the answer to the question I pose (What is Data's Killer App?), that being "there is none!" or perhaps "it's the web." (I kind of slyly gave the shortened bitly link of "it's rest" -- which is all about just using the web as it was meant to be used).

Also, I certainly did not want to hint in any way that it would be "effortless." Quite the contrary -- it's a huge undertaking. But it's an undertaking that requires the right tools -- if we are going to use the Web (of course!) -- let's use it and not try to put layers and layers of muck on top of it. That's exactly where the apps born in the late 90s got it wrong. I'd much prefer we look at "systems" and "approaches" (based on standard Open Web specs) and not focus on apps. It's the *interface* that matters (how do I interoperate w/ this app...), not the app itself.

As for the charge that "Keane's "killer app," ... will apparently serve every kind of research data in every discipline equally" I guess I'll plead guilty (insofar as I am talking about just using the web). It's the Web/HTTP -- if you are not using the Web (and thus HTTP) for interop (reuse and sharing) I suspect you are doing it wrong.

One point I want to be very explicit about (and I'm not sure if this was your take away or not): the system we have built and used at UT to great effect is NOT the killer app of which I am speaking. It's the approach we have used that has proven so effective: getting back to HTTP the way it was originally designed to be used. I'd compare our app to the Mosaic web browser in that regard -- it does amazing things, enabled by the specifications on which it is built. Other apps (say, Apache JackRabbit/Sling) will do the same and more slickly and solidly than we could pull off.

One kind of funny note: when you say "I generally don't bet on real investment in novelty in academia, and I'm even less likely to bet on it in a resource-constrained environment" I misread it the first time -- the REST/Web/HTTP style of architecture relies on this idea of "resources" on the web that can have various "representations" (i.e., a data set "resource" might be "represented" as XML, CSV, HTML , etc.). The idea is that you constrain interactions to operations (GET/PUT/POST/DELETE) on resources, as opposed to (in the RPC-style) where you perform remote procedural calls. It's the RPC style that has limited us in the flexibility/scalability/interoperability of our current suite of tools. So in that sense the idea of a "resource-constrained environment" make me quite pleased! ;-).

Of course you mean constrained resources, like "tight budgets." Along those lines, I have regularly recommended not innovation in the sense of "spend money," but rather pointed to nicely RESTful systems like YouTube, Flickr, etc. Look at what the Library of Congress has been doing w/ Flickr -- that's the kind of innovation (and *especially* in a time of tight budgets) that excites me.

I hope that helps!

Just to head off a possible point of confusion in my comment above... When I say "it's the *interface* that matters," I am not referring to the end-user interface (i.e., what is on the web page), but rather the API (i.e., services interface). If that is standardized & open (e.g., an Atom feed on blog), all kind of clients, both human and machine-based can interoperate with it. Another example would be, say, a SWORD interface to an IR (which, as it happens, is based on AtomPub).

Eric: Be careful saying that analog preservation is "easy," please. I have already commented at some length that it's not easy, it's just invisible. I decline to traduce the work of my analog-preservation colleagues, and I believe we should all so decline.

Peter: We may have to agree to disagree. You seem to believe the major challenges are technical. I don't. To me, the technical stuff is the STONE COLD SIMPLE part. The hard part is the social, cultural, and process changes that will have to happen before your techniques have anything at all to work with.

By way of example, this: how do all your protocols handle data that needs to be preserved but must not be shared?

Dorothea-

It may be stone cold simple (it's not to me!), but here's how the web deals with authentication: http://www.ietf.org/rfc/rfc2617.txt. It's not the complete answer to your question, but for stuff on the web, it's critical that we (or the developers of our tools) understand the web's authentication model.

My contention is that we think we understand the technical stuff but we largely don't. And poorly designed tools don't help.

Yes, I think "agree to disagree" is very much where we'll have to leave this. I know about RFC 2617. It doesn't help, because as you say, it's an incomplete answer. How do you know who I am? How do you know where I work? Can you prove it? If not, how do you use RFC 2617 to grant access to a given dataset to "only the librarians at MPOW"?

But that sort of thing is only partly a technical question, you see. It's partly the crucial social question of what do researchers want and need to do with data. All the RFCs in the world won't answer that one -- and if we try to answer it with RFCs, we're (bluntly) hosed.

No argument on poorly-designed tools; I've been a DSpace sysadmin. But DSpace's poor design is only partly internal. Much of it has to do with an abysmal understanding of the sociocultural problem space.

Ok, fair enough. I'll add, though, a point that I made in my last ("Take Two") blog post -- our problems are in no way unique/special. I guarantee that there are countless examples outside of the library/academia of data that needs to be preserved yet not shared. That we believe it's solely our role to solve that problem is an unwise/dangerous assumption. The issue of "who am I," "where do I work," "upon what authority can you prove it" is part of an ongoing (and in some places nicely addressed) problem. It's why the Shibboleth folks are asking for input and actively looking at what OpenID and OAuth have done as lower-barrier-to-entry technologies. Our role, as I see it, is to understand both the technical side (at least with a decent level of sophistication) and combine that with our unique and in-depth knowledge of the real world problems our users are facing (you'll hear no argument from me that we are uniquely situated to observe/understand those problems) so we can offer our tool makers better feedback. Otherwise we'll continue to be handed tools utterly unsuited to our needs.

Of course the problems are mostly social/cultural (not technical). I hope I don't seem to be suggesting as much. We have screws that need tightening and there is no avoiding that people are going to have to engage in tightening those screws. But we are all holding hammers. And it's really awkward to use that claw to fit in the screw, but we are getting better and better at it.... I'm just saying there might be a better way :-).

Really nice convo going here. The great thing about RFCs is that they are the result of "rough consensus", which is all about building community amongst Internet technologists.

I'd argue that RFC-like documents, and an RFC-like process are essential to building a shared, cross-institutional understanding of what digital-curation means.

I also think that the extent to which the digital-curation community sees their problem space as not entirely unique, will be the extent to which it will thrive. The technical and social problems are inseparable, and are shared quite widely, by folks outside academia.

Dorothea, an interesting and challenging piece as ever. Several things I wanted to respond to. You say you are "not at all sure we need to prove ab initio that keeping data is a good thing". Well, yes, I kind of agree... but I'm also quite sure that keeping all data is not a good thing. So keeping some, but not all data is good. Which data? Ah, that's a question for much, much more debate (one could postulate some classes of data but specifying a good set of data appraisal criteria is still a really tough challenge).

I also agree that there is no "killer-app magic bullet that will take an unholy mess of undescribed, undifferentiated digital stuff and miraculously organize it", and further that "Data curation requires skill, time, process change (a tall order all by itself), and resources". But two things occur to me here.

To the first order, dealing with the mess and providing the skills and changing processes is not the library's job, or any other "central" organisation's job. Dealing with data is the researcher's job. The way forward is to make it increasingly clear that data messes equal bad, un-reproducible research. Good data management is essential for good research. Period. The only way out of this that I can see (other than bribery and scandal as motivators, both of which we might be getting) is to include better data management training in the preparations for new researchers, ie PhD and Post-Doc courses. And that's a truly long-haul approach. Once we have some better managed, better curated data, then some central or shared group (eg library, data centre, whatever) has a reasonable chance of ingesting it and making it available. But rubbish data should be rejected. Always.

However, the second thing is that managing data in a research context is hard, and as far as I can see the tools (and standards) are not very good. There are some, but they tend not to be portable, and to be limited to a subset of disciplines. Even making sure your research group backs up its data is hard, when they use 3 different operating systems on 3 continents with 3 different sets of institutional requirements. Getting some "killer apps" to make that hard-grind technical stuff that bit easier (or even feasible, in some cases) would sure help to make the culture change work.

No-one could have forced academia to adopt the web if we had stuck with lynx, or whatever the character browser was. It took a smart set of standards AND a good piece of technology (Mosaic etc) to allow academics (and eventually others) to see how it could make their lives easier.

Chris, may I promote that comment to a guest post? :)

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

Fossil discovery is a new missing link in modern fish evolution

More by this author

We're moving!

August 3, 2010

Looking for us? We're happy to say that we're part of the new Scientopia blogging collective. Come see us there!

Belated Zombie Day post

July 13, 2010

Oh, if I'd only had this picture for Zombie Day... Credit for the photo to UK Serials Group. Credit for the alteration of the speech bubble (you can see the original slide here if you care to) to Steve Lawson. Incidentally, I should have a postprint of an article based on this presentation up…

Promoting a comment: "Open and shared format"

July 8, 2010

Richard Wallis has taken my ribbing in good part, which I appreciate; his response is here and will reward your perusal. He also left a comment here, part of which I will make bold to reproduce: As to RDF underpinning the Linked Data Web - it is only as necessary as HTML was to the growth of the…

Small fry, blogging networks, and reputation

July 8, 2010

So, the PepsiCo blog thing. Right. Advance disclaimer: this is me talking, not either of my illustrious co-bloggers. We have not yet made a decision about what to do; one co-blogger is across the pond at a conference and the other is vacationing, so that discussion will have to wait a bit. This is…

I'd love to dance with you, but...

July 6, 2010

Richard Wallis of Talis (a library-systems vendor) posted The Data Publishing Three-Step to the Talis blog recently. My reaction to this particular brand of reductionism is… shall we say, impolitic. I just want to pat Richard on the head and croon "Who's the clever boy, then? You are! Yes, you are…