Integrate. Annotate. Federate.

Following on to yesterday's post, where I wrote about the four functions that traditional publishers claim as their space (registration, certification, dissemination, preservation), I want to revisit an argument I made last week at the British Library.

In my slides, I argued that the web brings us at least three additional functions: integration, annotation, and federation. I wanted to get this argument out onto the web and get some feedback...

Let's start with integration. The article no longer sits on a piece of dead tree, inside a journal formatted by date and volume and page number. It exists as a digital entity, capable of dense integration into other digital entities. One way to think of this is to think of how the citation is truly weak tea compared to the hyperlink - an individual citation carries more weight than an individual hyperlink, but the hyperlink is so easy to create, and carries so much power in aggregate, that we get Google. Citations are the only way most articles are integrated with other articles, and that simply has to change.

Articles need to be integrated with lots of other digital information. Media is an obvious one, and the Elsevier-Cell "article of the future" seems to start here with an interview with the authors. To me this is absurd, and the height of how a "big company" thinks "the users" use the web. I don't want to hear an author interview with a reporter. I assume the author is going to say his or her work is sweets and sparkles and Nobel prizes. I'd rather see an embedded high-resolution video of all protocols necessary to replicate the experiment like the ones you get from JoVE (I'd like them to actually be open access too, but that's a different blog post).

If you want to make the article of the future, start with integration and work backwards. Don't start with the article and work forward, because you'll be trapped in document mentality instead of the network mentality.

We don't just want the data downloadable, we want to be able to run the same algorithms the author ran on the data, and adjust the variables myself, to see if the results are the output of statistical foul play or negligence. We want to be able to hide all the boring language that recapitulates past canon and focus on the new assertions, unless of course the author is trying to game the past canon and shade the facts. And we want to be able to effortlessly click out and get data about the assertions in the paper from other databases - when there's a gene mentioned, we should be able to one-click and run any number of core queries against the sequence, the ontological classifications, order genetic materials from biobanks and so forth.

Annotation is the second new essential function. The old method of annotation is through either writing a new paper that validates, invalidates, extends, or otherwise affects the assertions made in an old paper. Or if something is really wrong, there might be a letter to the editor or a retraction. In a wiki world, this is fundamentally insane. The paper is a snapshot of years of incremental knowledge progress. We have much better technology to use than dead trees.

Of course, there isn't any incentive to take the wiki that is science and actually use a wiki to create and edit it. Scientists get tenure for papers, and egoboo is cold comfort. Annotation needs to be provided by publishers, and is being provided, but the next step is to create an open platform that actually tracks the kind of annotation-relationships that the web enables. Bloggers use trackback to create a formal hyperlink between blog posts, and the protocol can and should be extended to let us connect all sorts of things: articles, wiki pages, database entries, catalog pages for biological materials, data sets, and on and on. By making these link transactions - which exist anyway - explicit and trackable, and most importantly reportable, we'll create a currency that scientists will gladly spend. It won't be about "sharing" but instead about "publishing" more of the intermediate knowledge that currently gets left on the lab floor when the paper gets written.

Federation is the last essential new function I'll deal with here (have some theories on other long term essential ones, but they're poorly formed in comparison). By federation I mean the ability to take a set of articles and federate them into a corpus with other materials. There's a lot of reasons one might want to do this: text mining, semantic indexing, integration with information that is private, and so forth. It's great to be able to read articles on the web. But if we're going to really explode the way we communicate, the ability to cache local copies (or cloud copies) in new formats for new kinds of analysis, and the right to then distribute the resulting corpus for follow-on innovation and exploration, is going to be central.

Publishers are so focused on the prevention of copying that they don't see the central business opportunity here: the human-readable, copyrighted version of the article is the least federation-friendly. Charge a fee to make the article beautifully machine-readable and give away the text - because the service of improving the technical aspects of the article is clearly a value-add that shouldn't be subject to a funder mandate.

Integration, Annotation, Federation. It's what the Web is all about. And if we can get to the point where publishers feel these as core responsibilities, the Open Access debate will have made a major leap. All of these create a world in which the text of the article itself is lower in economic value, and thus easily distributable, than the connectivity of that article into a larger web of information. OA is the beginning, not the end game, of making the web work for science the way it works for culture. Step two is all about the connectivity, and it's time to start arguing - loudly - for the right to start wiring the science together.

More like this

This post was prompted by the combination of three events: a visit with the founder of PubGet, an invitation to keynote at a conference on publishing, and an interview with Bora about the Science Online 2009 conference last January in RTP. The past year has seen an explosion of talk about the…
The Internet may have largely replaced many traditional means of storing and sharing information, but as ScienceBloggers are pointing out, it has far to go before its potential is fully realized, particularly in research. On Built on Facts, Matt Springer discusses what it would take to digitize…
I've been working on some text for a series of papers lately. I'm writing the core of a book proposal and working through the ideas around the knowledge web and the knowledge economy, and thought I'd post some interim thoughts here. Knowledge is a funny thing. Philosophers have spent eons debating…
John Wilbanks is brilliant - let's just get that down first.  He makes some great points in his most recent posts (1,2), but I also disagree with a few of the things he has said. In my abstract for the upcoming 4S conference, I echoed what Borgman and Bohlin both said, and one of his main points:…

Tracking technologies may record information such as Internet domain and host names; Internet protocol

Connecting weather with climate is a tricky thing. Some thoughts... one very interesting result of GCMs would be the projected locations of low and high pressure areas with a higher tropospheric energy (stored there

McCulloch accuses Steig et al. of appropriating his âfindingâ that Steig et al. did not account for autocorrelation when calculating the significance of trends. While the published version of the paper didnât include such a correction, it is obvious that the authors were aware of the need to do so, since in the text of the paper it is stated that this correction was made. The corrected calculations were done using well-known methods, the details of which are available in myriad statistics textbooks and journal articles. There can therefore be no claim on Dr. McCullochâs part of any originality either for the idea of making such a correction, nor for the methods for doing so, all of which were discussed in the original paper. Had Dr. McCulloch been the first person to make Steig et al. aware of the error in the paper, or had he written directly to Nature at any time prior to the submission of the Corrigendum, it would have been appropriate to acknowledge him and the authors would have been happy to do so. Lest there be any confusion about this, we note that, as discussed in the Corrigendum, the error has no impact on the main conclusions in the paper.

Fair enough, although I prefer RDFa to microformats. I do like the trackback protocol better for this particular function - it's easily extensible and lets us get fully directional typed links.

Mr Wilbanks, I read your post yesterday but didn't want to respond because it's one of those things that make you think deep.

Thank you for today's post. I do multimedia projects for a science publishing house, and it's a coincidence that I have been asked to find out 'how' (it's no longer about 'what' anymore) our readers want to access information on science. Then I found your blogs! Happy days. I'm going back to uni to do digital anthropology mainly to figure out a better science for the dissemination of information. Somehow I don't feel I can achieve that by doing an MSc in Compsci/Multimedia whatever. My hunch (not scientific, I know) is telling me that I've got to start with ehtnography/ethnology first.

Keep us posted if you're doing a lecture in London again. I sure like to attend the next one in the Big Smoke.

Do you want every ranting conspiracy theorist to be able to leave permanent comments on your research papers? One very good thing about peer review is that it excludes lunatics and people who are utterly ignorant about the topic under consideration. And there are a lot of them, and they love to post things on the web. Just look at the âtalkâ page for a few controversial Wikipedia articles.