I was asked in an interview recently about "open source science" and it got me thinking about the ways that, in the "open" communities of practice, we frequently over-simplify the realities of how software like GNU/Linux actually came to be. Open Source refers to a software worldview. It's about software development, not a universal truth that can be easily exported. And it's well worth unpacking the worldview to understand it, and then to look at the realities of open source software as they map - or more frequently do not map - to science.
The foundations of open source software are relatively easy to track. In the beginning, there was free software and Richard Stallman. RMS didn't just invent the GPL as a legal, he wrote crucial foundational software for writing software, notably the GNU compiler collection, GNU Debugger, and the original Emacs. So from the beginning, there was not only a free legal tool, but tools for coding that were better than other systems at the time.
Simultaneously, we can see that the emergence of microcomputers and ubiquitous access to the internet expanded the number (and interconnectivity) of potential programmers. Suddenly there were tens of thousands of programmers with computers at home and at work. The explosion of the Web saw the creation of infrastructure like code repositories, version control systems, and coding communities. Thanks to object-orientation, software was also very amenable to being broken into defined, modular chunks and tasks. One coder could work on a kernel function, another on a user interface function, a third on an application, and they could be reasonably sure that as long as they all followed the standards, their work would snap together into the growing distribution. The phrase "open source" can sort of be a shorthand for this kind of innovation, which we also see in wikipedia and other community built projects.
Open source, if we view it through a different lens, is really more about a distributed methodology for software development. The burden of creation is widely distributed across a massive community with more-or-less equal access to tools and systems. In this context, the role of the legal tool is more akin to an enzyme. It was an essential piece of a puzzle, but it was not the only piece. In fact, without the rest of the infrastructure (connectivity, tools, and people) the legal tool on its own would not have led us to GNU/Linux.
Yet far too often the focus on "porting" open source to science focuses on the legal aspects rather than performing an analysis of the infrastructure for science. Science is actually not very similar to modern software at this point. In science, especially life science, many of these factors don't exist. There isn't democratic access to tools. You tend to need a lab, which means you tend to need to work at a place big enough to afford a lab, which tends to mean you need an advanced degree, which means there is no crowd - thus the fundamentals for distributed science development aren't there. And when we try to force open source on a knowledge space that is fundamentally poorly structured for distributed development, we'll not only be frustrated by our failures to replicate the GNU/Linux and Wikipedia successes, we'll risk discrediting the idea of distribution itself.
Another problem: the open source approach, which is based on the open licensing of a powerful, moderately internationally harmonious property right, doesn't really apply very well to science, in which the IP situation is far more often patents v trade secret instead of copyright v copyleft. Copyrights are free to acquire, and thus easy to license at no cost as well. No one's losing an investment they made of $50,000 or more to acquire their copyright when they license code under copyleft. Patents are not so amenable to legal aikido. And they can kill a great idea in the cradle by tying up all the rights in a tangle of patent thickets and expensive licenses.
A third problem is that science is a long, long, long, long, long way from being a modular knowledge construction discipline. Whereas writing code forces the programmer to compile the code, and the standard distribution forces a certain amount of interoperability, scientists typically write up their knowledge as narrative text. It's written for human brains, not silicon compilers. Scientists are taught to think in a reductionist fashion, asking smaller and smaller questions to prove or disprove specific hypotheses. This system almost guarantees that the tasks fail to achieve modularity like software, and also binds scientists through tradition into a culture of writing their knowledge in a word processor rather than a compiler. Until we can achieve something akin to object-orientation in scientific discourse, we're unlike to see the distributed innovation erupt as it does in culture and code.
A fourth problem is that science has the additional problem of collective action congestion created by the significant institutional participation impact of research institutions, tech transfer offices, venture capital, startups, and so forth. Software isn't subject to these constraints, at least, not most software. But science is like writing code in the 1950s - if you didn't work at a research institution then, you probably couldn't write code, and if you did, you were stuck with punch cards. Science is in the punch cards stage, and punch cards aren't so easy to turn into GNU/Linux.
None of this is meant to discourage open approaches. We need to try. The problems we face, from neglected diseases to climate change to earthquake analysis to sustainability, are so complex that they'll probably overwhelm any approach that is not inherently distributed. Distributed systems scale much better than non-distributed, closed systems. But we should always understand the foundations, and closely examine our work to see if we need to work on building those foundations.
In the sciences, the first foundation is access to the narrative texts that form the canon of the sciences. Tens of thousands of papers are published a year. They need object-orientation - semantics - so that we can begin to treat that information as a platform, not a consumable product. Licensing is a part of this, but so is technology and scientific culture. Better ontologies, buy-in to technical standards, publisher participation in integration and federation, and more will be foundational to the establishment of content-as-platform. As the data deluge intensifies, this foundation becomes more and more important, as the literature provides the context for the data. Moving to a linked web or semantic web without a powerful knowledge platform at the base is building a castle made of sand - close to the water line.
Another foundation is access to tools and the creation of fundamental open tools. We need the biological equivalent of the C compiler, of Emacs. Stem cells, mice, vectors, plasmids, and more will need to available outside the old boy's club that dominates modern life sciences. We need access to supercomputers that can run massive simulations for earth sciences and climate sciences. These tools need to be democratized to bring the beginning of distributed knowledge creation into labs, with the efficiencies we know from eBay and Amazon (of course, these tools should perhaps be restricted to authenticated research scientists, so that we don't get garage biologists accidentally creating a super-virus).
The legal aspects weave through these foundations. The license has power to create freedoms but the improper application of a license approach carries significant risks. The "open source" meme can often feel a little religious about licenses, but it's good to remember that the GPL was invented not in the desire to write a license, but in a desire to return programming to a free state. With data and tools, we have the chance to avoid the intellectual property trap completely - if we have the nerve for it.
There is some distributed innovation happening in new fields of science, like DIY biology, and in non science communities, like patients sharing treatments and outcomes with each other. A quick examination of the foundations reveals they are ripe for distribution: DIY biology can build on open wetware, the registry of standard biological parts, and the availability of equipment and tools. Patients can connect using Web 2.0 and talk to each other without intermediaries. But this doesn't scale across into traditional science.
I propose that the point of this isn't to replicate "open source" as we know it in software. The point is to create the essential foundations for distributed science so that it can emerge in a form that is locally relevant and globally impactful. We can do this. But we have to be relentless in questioning our assumptions and in discovering the interventions necessary to make this happen. We don't want to wake up in ten years and realize we missed an opportunity by focusing on the software model instead of designing an open system out of which open science might emerge on its own.





Comments
You cite openwetware and the biobricks registry, but if you look closer, openwetware is a wiki, not a website about open source wetware tech. To my knowledge, other than the people over at diybio, there have been no signs of anyone with an understanding of free and open source software infrastructure (not the legalese- the toolchains) applying the concepts to the world of open source science. Your strikedown of compilers and kernels for the laboratory is pre-mature, IMHO.
- Bryan
Posted by: Bryan Bishop | October 30, 2009 12:13 PM
Hey John,
Great stuff, would love to hear what you think of our work on the hardware side. The "Open Gel Box" project is an initiative to bring biotech equipment into the 21st century. We need innovation in "established" tools to make them intuitive and accessible for anyone who wants to work with DNA. To that end, a group of users from the DIYbio list got together and designed a better, faster gel system than what exists today.
Pearl Biotech is now manufacturing a complete gel electrophoresis system according to the Open Gel Box design The Pearl Gel Box is available for $199 at http://www.pearlbiotech.com. We're advocating for better equipment on all fronts, such as an Open Thermal Cycler.
Tito
Posted by: Tito | October 30, 2009 12:31 PM
This is simply never gonna happen, because of the huge irreducible expense of maintaining and manipulating these reagents.
Posted by: Comrade PhysioProf | October 31, 2009 9:22 PM
I believe your historical facts are a little skewed. Open Biology perhaps began on the internet back with BIONET, which functioned well through the late 80's and early 90's, until the network apparently failed to grab sufficient interest for funding. (Spammers, as well, took over the network, but this could have been moderated against, with enough funding.) The motivation for "Open Biology" isn't a "new fields of science", it is as old as the field of Biology itself. However the Internet hasn't caught on that well with most scientists, because they are busy with science, not busy learning the latest Internet tools (like blogging). There have been efforts to create biology software repositories (similar to sourceforge.net except for Biology software) and these have largely failed to attract a majority of Bio-scientists too. The amount of collaboration we are blessed with on the Internet has accelerated the conversation between researchers in all fields, and that is the noticeable effect today. It would be great to accelerate this process even further, for example by expanding PLoS, encouraging all scientists to publish their working software (for example, MATLAB scripts) into open source repositories, and encouraging the people-in-the-middle (hobbyists, engineers) to publish in an intermediate form which isn't as strict as a scientific journal yet maintains some level of technological standard and legitimacy -- similar to the Internet RFC's, which started as simple technical memo's.
I am currently designing and writing an open source framework for Robotics and equipment control, for Biology/Chemical Engineering automation, which can be seen on CPAN. http://search.cpan.org/dist/Robotics/ To my surprise, this hasn't really been done yet.
Posted by: Jonathan Cline | November 1, 2009 8:53 AM
I definitely agree that there are some difficulties in the sciences as far as distribution is concerned. However, I think that the biggest issue is trust and safety. As a researcher who is forced to live by the law of "publish or perish", I see that a great deal of the resistance to the foundation spoken of here is the "old boys club" mentality and all that it implies. Researchers need money to fuel their inquiry and money is dependent on publications (first-author pubs at that) and well-established practices in the lab. Researchers see the power of the web and the social tools that could make distribution more available, but they balk at the idea of losing "ownership" of their data for fear of being scooped.
The primary legal protection that must be afforded researchers in order to allow the formation of foundations is attribution. If a universal format for attribution and material transfer could be made available (vis-a-vis Science Commons, etc) and widely known/accepted, more researchers would jump on the band wagon. The issue is not that science is so advanced that the masses cannot grasp it, or that the knowledge is unintuitive - it's that access to knowledge is shielded and so is access to the institutions/individuals that produce it. I think scientists would be more open to "open source" if they were simply afforded job security and attribution if they engage in it.
Rick Smith
Twitter: @h2oindio
Posted by: Rick Smith | November 5, 2009 9:14 AM
Dear Mr Wilbanks,
When are you next in London? Would you like to come and meet and have a chat with the digital anthropology students, and our lecturers, at University College London? Perhaps discuss your Science Commons Project over dinner with us?
Salina Christmas
MA Digital Anthropology
University College London
Posted by: Salina Christmas | November 21, 2009 9:59 AM