I was asked in an interview recently about “open source science” and it got me thinking about the ways that, in the “open” communities of practice, we frequently over-simplify the realities of how software like GNU/Linux actually came to be. Open Source refers to a software worldview. It’s about software development, not a universal truth that can be easily exported. And it’s well worth unpacking the worldview to understand it, and then to look at the realities of open source software as they map – or more frequently do not map – to science.
The foundations of open source software are relatively easy to track. In the beginning, there was free software and Richard Stallman. RMS didn’t just invent the GPL as a legal, he wrote crucial foundational software for writing software, notably the GNU compiler collection, GNU Debugger, and the original Emacs. So from the beginning, there was not only a free legal tool, but tools for coding that were better than other systems at the time.
Simultaneously, we can see that the emergence of microcomputers and ubiquitous access to the internet expanded the number (and interconnectivity) of potential programmers. Suddenly there were tens of thousands of programmers with computers at home and at work. The explosion of the Web saw the creation of infrastructure like code repositories, version control systems, and coding communities. Thanks to object-orientation, software was also very amenable to being broken into defined, modular chunks and tasks. One coder could work on a kernel function, another on a user interface function, a third on an application, and they could be reasonably sure that as long as they all followed the standards, their work would snap together into the growing distribution. The phrase “open source” can sort of be a shorthand for this kind of innovation, which we also see in wikipedia and other community built projects.
Open source, if we view it through a different lens, is really more about a distributed methodology for software development. The burden of creation is widely distributed across a massive community with more-or-less equal access to tools and systems. In this context, the role of the legal tool is more akin to an enzyme. It was an essential piece of a puzzle, but it was not the only piece. In fact, without the rest of the infrastructure (connectivity, tools, and people) the legal tool on its own would not have led us to GNU/Linux.
Yet far too often the focus on “porting” open source to science focuses on the legal aspects rather than performing an analysis of the infrastructure for science. Science is actually not very similar to modern software at this point. In science, especially life science, many of these factors don’t exist. There isn’t democratic access to tools. You tend to need a lab, which means you tend to need to work at a place big enough to afford a lab, which tends to mean you need an advanced degree, which means there is no crowd – thus the fundamentals for distributed science development aren’t there. And when we try to force open source on a knowledge space that is fundamentally poorly structured for distributed development, we’ll not only be frustrated by our failures to replicate the GNU/Linux and Wikipedia successes, we’ll risk discrediting the idea of distribution itself.
Another problem: the open source approach, which is based on the open licensing of a powerful, moderately internationally harmonious property right, doesn’t really apply very well to science, in which the IP situation is far more often patents v trade secret instead of copyright v copyleft. Copyrights are free to acquire, and thus easy to license at no cost as well. No one’s losing an investment they made of $50,000 or more to acquire their copyright when they license code under copyleft. Patents are not so amenable to legal aikido. And they can kill a great idea in the cradle by tying up all the rights in a tangle of patent thickets and expensive licenses.
A third problem is that science is a long, long, long, long, long way from being a modular knowledge construction discipline. Whereas writing code forces the programmer to compile the code, and the standard distribution forces a certain amount of interoperability, scientists typically write up their knowledge as narrative text. It’s written for human brains, not silicon compilers. Scientists are taught to think in a reductionist fashion, asking smaller and smaller questions to prove or disprove specific hypotheses. This system almost guarantees that the tasks fail to achieve modularity like software, and also binds scientists through tradition into a culture of writing their knowledge in a word processor rather than a compiler. Until we can achieve something akin to object-orientation in scientific discourse, we’re unlike to see the distributed innovation erupt as it does in culture and code.
A fourth problem is that science has the additional problem of collective action congestion created by the significant institutional participation impact of research institutions, tech transfer offices, venture capital, startups, and so forth. Software isn’t subject to these constraints, at least, not most software. But science is like writing code in the 1950s – if you didn’t work at a research institution then, you probably couldn’t write code, and if you did, you were stuck with punch cards. Science is in the punch cards stage, and punch cards aren’t so easy to turn into GNU/Linux.
None of this is meant to discourage open approaches. We need to try. The problems we face, from neglected diseases to climate change to earthquake analysis to sustainability, are so complex that they’ll probably overwhelm any approach that is not inherently distributed. Distributed systems scale much better than non-distributed, closed systems. But we should always understand the foundations, and closely examine our work to see if we need to work on building those foundations.
In the sciences, the first foundation is access to the narrative texts that form the canon of the sciences. Tens of thousands of papers are published a year. They need object-orientation – semantics – so that we can begin to treat that information as a platform, not a consumable product. Licensing is a part of this, but so is technology and scientific culture. Better ontologies, buy-in to technical standards, publisher participation in integration and federation, and more will be foundational to the establishment of content-as-platform. As the data deluge intensifies, this foundation becomes more and more important, as the literature provides the context for the data. Moving to a linked web or semantic web without a powerful knowledge platform at the base is building a castle made of sand – close to the water line.
Another foundation is access to tools and the creation of fundamental open tools. We need the biological equivalent of the C compiler, of Emacs. Stem cells, mice, vectors, plasmids, and more will need to available outside the old boy’s club that dominates modern life sciences. We need access to supercomputers that can run massive simulations for earth sciences and climate sciences. These tools need to be democratized to bring the beginning of distributed knowledge creation into labs, with the efficiencies we know from eBay and Amazon (of course, these tools should perhaps be restricted to authenticated research scientists, so that we don’t get garage biologists accidentally creating a super-virus).
The legal aspects weave through these foundations. The license has power to create freedoms but the improper application of a license approach carries significant risks. The “open source” meme can often feel a little religious about licenses, but it’s good to remember that the GPL was invented not in the desire to write a license, but in a desire to return programming to a free state. With data and tools, we have the chance to avoid the intellectual property trap completely – if we have the nerve for it.
There is some distributed innovation happening in new fields of science, like DIY biology, and in non science communities, like patients sharing treatments and outcomes with each other. A quick examination of the foundations reveals they are ripe for distribution: DIY biology can build on open wetware, the registry of standard biological parts, and the availability of equipment and tools. Patients can connect using Web 2.0 and talk to each other without intermediaries. But this doesn’t scale across into traditional science.
I propose that the point of this isn’t to replicate “open source” as we know it in software. The point is to create the essential foundations for distributed science so that it can emerge in a form that is locally relevant and globally impactful. We can do this. But we have to be relentless in questioning our assumptions and in discovering the interventions necessary to make this happen. We don’t want to wake up in ten years and realize we missed an opportunity by focusing on the software model instead of designing an open system out of which open science might emerge on its own.