I'd love to dance with you, but...

Richard Wallis of Talis (a library-systems vendor) posted The Data Publishing Three-Step to the Talis blog recently.

My reaction to this particular brand of reductionism is… shall we say, impolitic. I just want to pat Richard on the head and croon "Who's the clever boy, then? You are! Yes, you are!" This is terrible of me, no question about it, and I apologize unreservedly.

Here's the problem, though. Aside from my friends the open scientists (and not even all of them, to be honest), practically all the data-producing researchers I know are firmly stuck on Step 1. Firmly stuck, not to say "immovably." As for Step 2… trust me, these folks are not data modellers. I sincerely doubt my own capacity to teach RDF to someone who approaches me asking, "Is it okay if I record my data in Excel?"

Noting that I have been a longtime RDF skeptic so that you all can discount my peculiar biases, I will say that this disconnect between Linked Data proponents and Joe Q. Researcher concerns me a great deal, mirroring as it does the prior disconnect between RDF advocates and web programmers and content producers, a disconnect that has thus far prevented RDF from becoming common currency on the web.

The bar is too high, folks. It is too high. For my part, I'm starting somewhere both simpler and more complex: working on convincing people that exposing data in any form, emphatically including Excel, is a worthwhile thing to try.

Tags

More like this

Excellent post, with which I whole-heartedly agree. This gets back to an exchange we had some months back (in which I was proposing a "killer app for data") and maybe I can reiterate a bit of my thinking in the context of this post. What Richard Wallis wants is rich data that allows interoperability, reuse, etc. Whether it is realistic or not aside, it is certainly a worthy goal. What you want is simple, straightforward tools that don't distract from the task at hand. Ideally tools that are familiar, that we already know. Honestly, I thinking Excel *is* the killer app. It's well-known, has a low barrier to entry, etc. But the bridge can and should be gapped -- there's no reason a spreadsheet cannot be exported as RDF. And perhaps scholars can share not only spreadsheets but spreadsheet templates (shared metadata schemas FTW). This is how the Linked Data vision will take hold, *not* by lots of people learning RDF and related tech. Give me an Excel template and a "publish to the web" button any day.

Agreed wholeheartedly here. The group of people who are generating useful biological data are almost entirely non-overlapping with the group who are thinking of useful ways to model such data. The only way this beast will be tamed is for the relatively large number of people who'd be willing to upload an Excel file to just go ahead and do that, and then let the people who are looking for data to model go to town on it, hopefully eventually providing a justification to the biologist who uploaded it that it was a worthwhile thing to do.

Data and tools for working with such data co-evolve!

Seems to me that this is a good example of how the 'two-node pipeline between researcher and an open and linked dataset online' model just isn't realistic. Just the quickly-mentioned note in Step 1: 'The terms, codes, identifiers etc. you use may be meaningless, or worse still ambiguous, to those outside your organisation, or even your department' can easily take a researcher/grad student/data manager months or more to work through in some fields where practices and semantics are still largely local (department? Try lab!). I don't think responsibility for keeping up with the latest data trends and tools (much less implementing them)is reasonable to shift that far upstream.

My worry is that all of this fun information stuff (which to me includes linking) has a large learning curve with new terminology, etc, ends up changing quickly, is being sold as simple, and is adding new roles and responsibilities faster than anyone can pick them up. At this point, those responsibilities are being heaped on researchers...I can't blame them for setting a nice, strong boundary at Step 1 (aka. being 'firmly stuck')!

Seems to me that broadening that pipeline model to include local and community data centers, libraries and others is one more reasonable approach to closing the disconnect. It takes time to find support, develop expertise and infrastructure, etc, unfortunately much more time than is needed to develop the next new tool/trend/technology.

Mr. Gunn and Lynn echo my thoughts precisely. This just goes to show that BoT commenters are smarter than I am.

I am a fan of Semantic Web technology like RDF and work on publishing library data as linked data. So, I am not a RDF-sceptic, but I agree with you. And I believe, Tim Berners-Lee also agreed with you when he asked us to give us the raw data now (http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html) - and Excel is fine. We can't ask researchers to transform their data into RDF. Others, who are interested in that and able to do so, should do it. Computer software could do it. Microsoft could build some RDF-export tool for Excel. What is more important is, that the researchers make their data available as open data. This means for me two things: rights and documentation. Allow that anybody who wants to use/transform the data can do so in any way she or he wants to. And document the data in order that others are able to make sense out of it. Give us Excel but please tell us, what do those rows and columns mean.

Just by coincidence, I just saw this little demo on using Excel for linked data solutions in my feed reader.

I do think these debates sometimes proceed with unstated goals and requirements. To really boil down the debate, it seems to me linked data advocates (of which I'd consider myself one) put more work upfront in producing and distributing data to make consuming it much easier, while critics often want to get the data out with minimal work, but as a consequence make it more difficult to consume. In the end, it's all about tradeoffs. And along with some of the other comments here, I think there's room for considering a continuity of practices that could be sympathetic, rather than simply seeing diametrically opposing camps.

Bruce, I think you may be mischaracterizing my criticism just a tiny bit, so let me try to rephrase it.

I'm all FOR putting more upfront work in. I absolutely adore clean pretty data. I just believe quite strongly that any data is better than none, and I am also aware from considerable experience and observation that researchers are always unwilling and often unable to put the kind and amount of work in that making a dataset into Linked Data would entail.

Rather than lose the data entirely, I'll take the non-LD form of it.

Moreover, my sense is that data reusers have more intrinsic incentive to put in effort to munge a useful dataset to their needs than data originators have to LD-ify it upfront. I'm content to work with that tendency rather than try to wrench the motivations of data originators into a more useful configuration -- that's a land-war-in-Asia sort of battle.

I do understand that LD advocates aren't that patient, but I genuinely do believe (and this is me getting out my crystal ball, so call me on it if I turn out to be wrong!) that LD-as-RDF will die a very similar lingering death to RDF-on-the-Web, and for basically the same reasons: insufficient incentive for data generators, and a much-too-high learning and implementing curve.

I expect data sharing in many forms to flourish, however.

In my opinion, it's unwise to make a clean distinction between data re-users and data originators. Top-notch, real world research is done using data that has some provenance to it (perhaps even most research). And work that "synthesizes" generally "produces" as well. That the debate is so often couched as being between diametrically opposing camps, as Bruce stated, is a real shame. Frankly, who cares about RDF (I don't particularly), but right now we need data formats that are easy to share. The "default" we're talking about is 1. proprietary binaries (Excel documents) or 2. text/csv, a quite ambiguously specified "standard." We should *all* be asking for both ease-of-use AND optimal share-ability. Like it or not, you *are* using data formats. As "professionals" is in not worth putting a bit of thought into them? (And, it goes without saying, making is as easy as possible for scientists to "do the right thing"?). As far as I am concerned, criticizing RDF is fine, but I do question the wisdom of criticizing what the Linked Data effort is trying to accomplish (or suggesting that its goals are not in every researcher's interest).

I'm not criticizing what they're trying to accomplish. I'm trying to accomplish it too. I just think their dogmatic insistence on RDF is getting in everyone's way, them emphatically included, especially when they imply that RDF and high-level data modeling are mandatory.

Am I thrilled by Excel or CSV? Hell no. Hate 'em both. Realistically, though, I can get Excel and CSV, and I can't get Linked Data no matter how hard I try. So I can be RDFly pure and turn away anything that isn't, or I can collect data.

I know which I'm picking, that's all. In the long run, sure, I expect life will converge on something that looks a fair bit more Linked Data-ish than things do now. I welcome that day. It's just not going to happen by telling Joe Q. Researcher that he can't even play in this space unless he speaks Linked Data. "Fine," says Joe Q. "I don't play, then."

That's what the web did to RDF. That's what Joe Q. did to institutional repositories. I have yet to see a comment explaining to me clearly and lucidly why the same thing won't happen here.

And to be perfectly clear, it won't not happen because of me. If I find a researcher interested in Linked Data, I'll gladly explain everything I know about RDF and cool URIs and so forth. You betcha.

Still waiting to hear from that researcher, though... *crickets*

Yes, I think we agree. As far as I am concerned, researchers should absolutely not need to know a bit about RDF or the "Linked Data" movement (that's data geek stuff). Maybe a bit about unambiguous identifiers, but other than that, good data hygiene should be built into the tools (and they need to be as easy or easier than Excel).

--a non-proselytizing data geek ;-)

Still recovering from the surprise pat on the head, I have posted my further thoughts on step one - One Step at a Time.

I certainly wasn't trying to say that data producers need to change their methods and understanding to include RDF. I apologise if that was the impression gained.

(Step one only people stop reading here...)

As to RDF underpinning the Linked Data Web - it is only as necessary as HTML was to the growth of the Web itself. Documents were being posted on the Internet in all sorts of formats well before Tim Berners-Lee introduced us to the open and shared HTML format which facilitated the exponential growth of the Web. Some of the above comments are very reminiscent of the "why do I need to use HTML" discussions from the mid 1990's.

It is an open an shared format, such as RDF, that will power the exponential growth of the Linked Data web, but the conversations around it are still at the equivalent of 1995 stage.