I’ve been working on some text for a series of papers lately. I’m writing the core of a book proposal and working through the ideas around the knowledge web and the knowledge economy, and thought I’d post some interim thoughts here.
Knowledge is a funny thing. Philosophers have spent eons debating it. I’m not going to figure it out here – in fact, the conclusion that I wasn’t going to figure it out played a big role in my choosing not to go to graduate school. But on the web, we have these things that are kind-of-knowledge. Databases. Journal articles. Web pages. Ontologies.
Taken together, these things are somewhere in the epistemological chain. But the act of digitizing them does some strange things…they start to form an observable, computable network, a knowledge web of sorts. And in a knowledge web, we have to understand a important conceptual transformation that knowledge itself needs to be treated as something similar to software, something upon which computing happens and depends – and the implications of that transformation.
The great revolutions of the internet, the web, and free software were all predicated on access to sources and standards – a mix of technical and legal access. The internet didn’t really have to deal with the law, as TCP/IP didn’t really affect copyright. The web ignored copyright from a legal perspective, but actively encouraged viewing and copying from a technical perspective. Free software embedded legal freedoms inside the technical access concept.
But knowledge is different, as the vast majority of the canon is already embedded in creative works protected by copyrights. Thus, we have to unlock some content if we’re going to reformat it into something that can in turn be treated as an interim step along the way to knowledge, and then used as cyberinfrastructure. This is why Open Access is so crucial. Whatever knowledge is, a lot of it is locked behind paywalls, copyright licenses, or trapped in lousy formats from a machine perspective.
But – if we have access – if we can take the individual facts described in papers and turn them into modelable knowledge, or at least precursors to knowledge, we convert those facts into infrastructure for construction into something bigger, for composition into structures that software can use.
This transformation is already under way in the life sciences. Most of the valuable CI data in the life sciences has been hand-curated out of journal articles into more structured sources like the Kyoto Encyclopedia of Genes and Genomes, or the Human Protein Reference Database, or the Information Hyperlinked Over Proteins, and on and on.
This needs to be accelerated and industrialized, as the human-readable paper is the least valuable format of knowledge from a cyberinfrastructure/CI perspective. But this requires an understanding of access to the knowledge canon as a fundamental lever of CI construction in a knowledge web. Unfortunately most of these databases tend to have copyright or contractual restrictions that make it impossible to build on them as infrastructure (particularly non-commercial restrictions or restrictions on redistribution in federated or integrated knowledgebases). That’s why open access to databases is essential as well.
We are lucky to have vast amounts of public domain databases that are, from a CI perspective, un-networked. The scientist needs to open a dozen or more tabs in a browser and use her own mind to integrate the results. That’s lousy. But it’s a natural outcome of the web not integrating databases the way it integrates documents, and at least the legal terms let us start to integrate.
There are competing philosophies about how to deal with the integration – I follow the one that believes again that software is the right metaphor for dealing with knowledge integration, and that data integration is a plausible first place to start working on knowledge integration. It’s certainly better than getting stuck in an infinite loop arguing about what knowledge is. It’s a funny place to come back into ontological realism after nearly 20 years away from the academy but this approach indeed demands a certain amount…because you’re dealing with database records that need to be reconciled, not ideas like “gene” and so on, and if you’re going to write code about them realism helps. But I digress. Back to software integration.
The way we integrate software in free software is via the *distribution* – a community using a standard set of kernel interfaces to knit together multiple software packages. This is a model for data integration, and the SC Neurocommons project is the first one that I know of – released in October 2008 – and we’re already seeing some encouraging early returns (I love the version that a user installed on the Amazon cloud). The idea is to let users who like our modeling and ontological work simply expose a version of their database using our standards, and then any user or community that wants to add that database to the distribution can do so with minimal effort, just like adding a new software package to a linux distribution.
Note that we are assuming from the beginning that everyone has a different idea of knowledge – people will disagree with our models, and we’ve pre-emptively guaranteed the right to “fork” knowledge like software so that each community can craft its own solution based on our kernels.
This is all a way of trying to leverage techniques we’ve seen work in the service of complex systems creation by distributed inputs. It might work – I hope it does. It might also be an evolutionary step along the path. But clearly we need some evolution away from the human-readable paper and the standalone database as containers for the things we know, or the things we believe. The information space is simply too big for any one brain to process any more, and Google simply isn’t as efficient for science as it is for culture…