Open Data and Creative Commons: It's About Scale... [Common Knowledge]

As part of the series of posts reflecting on the move of Science Commons to Creative Commons HQ, I'm writing today on Open Data.

I was inspired to start the series with open data by the remarkable contribution, by GSK, to the public domain of more than 13,000 compounds known to be active against malaria. They were the first large corporation to implement the CC0 tool for making data into open data. CC0 is the culmination of years of work at Creative Commons, and the story's going to require at least two posts to tell...

Opening up data was a founding aspect of the Science Commons project at CC. I came to the Creative Commons family after spending six years mucking about in scientific data, first trying to make public databases more valuable at my startup Incellico, and later at the World Wide Web Consortium (W3C) where I helped launch the interest group on the semantic web for life sciences. When I left the W3C in late 2004, data was my biggest passion - and it remains a driving focus of everything we do at Creative Commons.

Data is tremendously powerful. If you haven't read the Halevy, Norvig & Pereira article on the unreasonable effectiveness of data, go do so, then come back here. It's essential subtext for what we do at Creative Commons on open data. But suffice to say that with enough data, a lot of problems become tractable that were not tractable before.

Perhaps most important in the sciences, data lets us build and test models. Models of disease, models of climate, models of complex interactions. And as we move from a world in which we analyze at a local scale to one where we analyze at global scale, the interoperability of data starts to be an absolutely essential pre-condition to successful movement across scales. Models rest on top of lots of data that wasn't necessarily collected to support any given model, and scalable models are the key to understanding and intervening in complex systems. Sage Bionetworks is a great example of the power of models, and the Sage Commons Congress a great example of leveraging the open world to achieve scale.

Building the right model, and responding to the available data, is the difference between thinking we have 100,000 genes or 20,000. Between thinking carbon is a big deal in our climate or not. And scale is at the heart of using models. Relying on our brains to manage data doesn't scale. Models - the right ones - do scale.

My father (yeah, being data-driven runs in the family!) has done years of important work of the importance of scale that I strongly recommend. His work relates to climate change and climate change adaptation, but it applies equally to most of the complex, massive-scale science out there today. Scale - and integration - are absolutely essential aspects of data, and it is only by reasoning backward from the requirements imposed by scale and integration that we are likely to arrive at the right uses cases and tasks for the present day, whether that be technical choices or legal choices about data.

We chose a twin-barreled strategy at Creative Commons for open data.

First, the semantic web was going to be the technical platform. This wasn't in a belief that somehow the semantic web would create a Star Trek world in which one could declaim "computer, find me a drug!" and get a compound synthesized. We instead arrived at the semantic web by working backward from the goal of having databases interoperate the way the Web interoperates, where a single query into a search engine would yield results from across tens of thousands of databases, whether or not those databases were designed to work together.

We also wanted, from the start, to make it easy to integrate the web of data and the scholarly literature, because it seemed crazy that articles based on data were segregated away from the data itself. The semantic web was the only option that served both of those tasks, so it was an easy choice - it supports models, and it's a technology that can scale alongside its success.

The second barrel of the strategy was legal. The law wasn't, and isn't, the most important part of open data - the technical issues are far more problematic, given that I could send out a stack of paper-based data with no legal restraints and it'd still be useless.

But dealing with the law is an essential first step. We researched the issue for more than two years, examining the application of Creative Commons copyright licenses for open data, the potential to use the wildly varied and weird national copyright regimes or sui generis regimes for data, the potential utility of applying a contract regime (like we did in our materials transfer work), and more. Lawyers and law professors, technologists and programmers, scientists and commons advocates all contributed.

In our search for the right legal solution for open data, we held conferences at the US National Academy of Science, informal study sessions at three international iSummit conferences, and finally a major international workshop at the Sorbonne. We drew in astronomers, anthropologists, physicists, genomicists, chemists, social scientists, librarians, university tech transfer offices, funding agencies, governments, scientific publishers, and more. We heard a lot of opinions, and we saw a pattern emerge. The successful projects that scaled - like the International Virtual Observatory Alliance or the Human Genome Project - used the public domain, not licenses, for their data, and managed conflicts with norms, not with law.

We also ran a technology-driven experiment. We decided to try to integrate hundreds of life science data resources using the Linux approach as our metaphor, where each database was actually a database package and integration was the payoff. We painstakingly converted resource ater resource to RDF/OWL packages. We wrote the software that wires them all together into a single triple store. We exposed the endpoint for SPARQL queries. And we made the whole thing available for free.

As part of this, we had to get involved in the OWL 2 working group, the W3C's technical architecture group, and more. We had to solve very hairy problems about data formats. We even had to develop a new set of theories about how to promote shared URIs for data objects.

Like I said, the technology was a hell of a lot hairier than the law. But it worked. We get more than 37,000 hits a day on the query endpoint. There's at least 20 full mirrors in the wild that we know of. It's beginning to scale.

But because of the law, we also had to eliminate good databases with funky legal code, including funky legal code meant to foster more sharing. We learned pretty quickly that the first thing you do with those resources is to throw them away, even if they meant well in licensing. The technical work is just too hard. Adding legal complexity to the system made the work intolerable. When you actually try to build on open data, you learn quickly that unintended use tends to rub up against legal constraints of any kind, share-alike constraints equal to commercial constraints.

We would never have learned this lesson without actually getting deep into the re-use of open databases ourselves. Theory, in this case, truly needed to be informed by practice.

What we learned, first and foremost, was that the combination of truly open data and semantic web supports the use of that data at Web scale. It's not about open spreadsheets, or open databases hither and yon. It's not about posting tarballs to one's personal lab page. Those are laudable activities, but they don't scale. Nor does applying licenses to the data that impose constraints on downstream use, because the vast majority of uses of data aren't yet known. And today's legal code might well prevent them. Remember that the fundamental early character of the Web was public domain, not copyleft. Fundamental stuff needs fundamental treatment.

And data is fundamental. We can't treat it, technically or legally, like higher level knowledge products if we want it to serve that fundamental role. The vast majority of it is in fact not "knowledge" - it is the foundation upon which knowledge is built, by analysis, by modeling, by integration into other data and data structures. And we need to begin thinking of data as foundation, as infrastructure, as a truly public good, if we are to make the move towards a web of data, a web that supports models, a world in which data is useful at scale.

I'll return to the topic in my next post to outline exactly how the Creative Commons toolkit - legal, technical, social - serves the Open Data community.

More like this