Open Source Science? Or Distributed Science?

By jwilbanks on October 30, 2009.

I was asked in an interview recently about "open source science" and it got me thinking about the ways that, in the "open" communities of practice, we frequently over-simplify the realities of how software like GNU/Linux actually came to be. Open Source refers to a software worldview. It's about software development, not a universal truth that can be easily exported. And it's well worth unpacking the worldview to understand it, and then to look at the realities of open source software as they map - or more frequently do not map - to science.

The foundations of open source software are relatively easy to track. In the beginning, there was free software and Richard Stallman. RMS didn't just invent the GPL as a legal, he wrote crucial foundational software for writing software, notably the GNU compiler collection, GNU Debugger, and the original Emacs. So from the beginning, there was not only a free legal tool, but tools for coding that were better than other systems at the time.

Simultaneously, we can see that the emergence of microcomputers and ubiquitous access to the internet expanded the number (and interconnectivity) of potential programmers. Suddenly there were tens of thousands of programmers with computers at home and at work. The explosion of the Web saw the creation of infrastructure like code repositories, version control systems, and coding communities. Thanks to object-orientation, software was also very amenable to being broken into defined, modular chunks and tasks. One coder could work on a kernel function, another on a user interface function, a third on an application, and they could be reasonably sure that as long as they all followed the standards, their work would snap together into the growing distribution. The phrase "open source" can sort of be a shorthand for this kind of innovation, which we also see in wikipedia and other community built projects.

Open source, if we view it through a different lens, is really more about a distributed methodology for software development. The burden of creation is widely distributed across a massive community with more-or-less equal access to tools and systems. In this context, the role of the legal tool is more akin to an enzyme. It was an essential piece of a puzzle, but it was not the only piece. In fact, without the rest of the infrastructure (connectivity, tools, and people) the legal tool on its own would not have led us to GNU/Linux.

Yet far too often the focus on "porting" open source to science focuses on the legal aspects rather than performing an analysis of the infrastructure for science. Science is actually not very similar to modern software at this point. In science, especially life science, many of these factors don't exist. There isn't democratic access to tools. You tend to need a lab, which means you tend to need to work at a place big enough to afford a lab, which tends to mean you need an advanced degree, which means there is no crowd - thus the fundamentals for distributed science development aren't there. And when we try to force open source on a knowledge space that is fundamentally poorly structured for distributed development, we'll not only be frustrated by our failures to replicate the GNU/Linux and Wikipedia successes, we'll risk discrediting the idea of distribution itself.

Another problem: the open source approach, which is based on the open licensing of a powerful, moderately internationally harmonious property right, doesn't really apply very well to science, in which the IP situation is far more often patents v trade secret instead of copyright v copyleft. Copyrights are free to acquire, and thus easy to license at no cost as well. No one's losing an investment they made of $50,000 or more to acquire their copyright when they license code under copyleft. Patents are not so amenable to legal aikido. And they can kill a great idea in the cradle by tying up all the rights in a tangle of patent thickets and expensive licenses.

A third problem is that science is a long, long, long, long, long way from being a modular knowledge construction discipline. Whereas writing code forces the programmer to compile the code, and the standard distribution forces a certain amount of interoperability, scientists typically write up their knowledge as narrative text. It's written for human brains, not silicon compilers. Scientists are taught to think in a reductionist fashion, asking smaller and smaller questions to prove or disprove specific hypotheses. This system almost guarantees that the tasks fail to achieve modularity like software, and also binds scientists through tradition into a culture of writing their knowledge in a word processor rather than a compiler. Until we can achieve something akin to object-orientation in scientific discourse, we're unlike to see the distributed innovation erupt as it does in culture and code.

A fourth problem is that science has the additional problem of collective action congestion created by the significant institutional participation impact of research institutions, tech transfer offices, venture capital, startups, and so forth. Software isn't subject to these constraints, at least, not most software. But science is like writing code in the 1950s - if you didn't work at a research institution then, you probably couldn't write code, and if you did, you were stuck with punch cards. Science is in the punch cards stage, and punch cards aren't so easy to turn into GNU/Linux.

None of this is meant to discourage open approaches. We need to try. The problems we face, from neglected diseases to climate change to earthquake analysis to sustainability, are so complex that they'll probably overwhelm any approach that is not inherently distributed. Distributed systems scale much better than non-distributed, closed systems. But we should always understand the foundations, and closely examine our work to see if we need to work on building those foundations.

In the sciences, the first foundation is access to the narrative texts that form the canon of the sciences. Tens of thousands of papers are published a year. They need object-orientation - semantics - so that we can begin to treat that information as a platform, not a consumable product. Licensing is a part of this, but so is technology and scientific culture. Better ontologies, buy-in to technical standards, publisher participation in integration and federation, and more will be foundational to the establishment of content-as-platform. As the data deluge intensifies, this foundation becomes more and more important, as the literature provides the context for the data. Moving to a linked web or semantic web without a powerful knowledge platform at the base is building a castle made of sand - close to the water line.

Another foundation is access to tools and the creation of fundamental open tools. We need the biological equivalent of the C compiler, of Emacs. Stem cells, mice, vectors, plasmids, and more will need to available outside the old boy's club that dominates modern life sciences. We need access to supercomputers that can run massive simulations for earth sciences and climate sciences. These tools need to be democratized to bring the beginning of distributed knowledge creation into labs, with the efficiencies we know from eBay and Amazon (of course, these tools should perhaps be restricted to authenticated research scientists, so that we don't get garage biologists accidentally creating a super-virus).

The legal aspects weave through these foundations. The license has power to create freedoms but the improper application of a license approach carries significant risks. The "open source" meme can often feel a little religious about licenses, but it's good to remember that the GPL was invented not in the desire to write a license, but in a desire to return programming to a free state. With data and tools, we have the chance to avoid the intellectual property trap completely - if we have the nerve for it.

There is some distributed innovation happening in new fields of science, like DIY biology, and in non science communities, like patients sharing treatments and outcomes with each other. A quick examination of the foundations reveals they are ripe for distribution: DIY biology can build on open wetware, the registry of standard biological parts, and the availability of equipment and tools. Patients can connect using Web 2.0 and talk to each other without intermediaries. But this doesn't scale across into traditional science.

I propose that the point of this isn't to replicate "open source" as we know it in software. The point is to create the essential foundations for distributed science so that it can emerge in a form that is locally relevant and globally impactful. We can do this. But we have to be relentless in questioning our assumptions and in discovering the interventions necessary to make this happen. We don't want to wake up in ten years and realize we missed an opportunity by focusing on the software model instead of designing an open system out of which open science might emerge on its own.

More like this

Basics: Standard Deviation

When we look at a the data for a population+ often the first thing we do is look at the mean. But even if we know that the distribution

Seasons, short and simple

I love this question: Why is it warmer in the summer than in the winter (for the Northern hemisphere)? Go ahead and ask your friends. I suppose they will give one of the following likely answers:

The Real Bozo Attempts to Atone: Why the DDWFTW Car Works

Technorati Tags: ddftw, bozos, markcc-screwups

BIO101 - Lecture 7 - Physiology: Coordinated Response

Last week we looked at the organ systems involved in regulation and control of body functions: the nervous, sensory, endocrine and circadian systems. This week, we will cover the organ systems that are regulated and controlled.

Hi all.. really i am happy to see this site and all.. it has lot of information's to the public.. thanks for sharing this in internet...

I take your point that the current scientific community is smaller than the current open source development community and that there are particular challenges to doing science in the open.

This may have a lot to do with points of reference. I have daughters and am, by by native country's (U.S.A.) standards, left-leaning.

I definitely agree that there are some difficulties in the sciences as far as distribution is concerned. However, I think that the biggest issue is trust and safety. As a researcher who is forced to live by the law of "publish or perish", I see that a great deal of the resistance to the foundation spoken of here is the "old boys club" mentality and all that it implies. Researchers need money to fuel their inquiry and money is dependent on publications (first-author pubs at that) and well-established practices in the lab. Researchers see the power of the web and the social tools that could make distribution more available, but they balk at the idea of losing "ownership" of their data for fear of being scooped.

You cite openwetware and the biobricks registry, but if you look closer, openwetware is a wiki, not a website about open source wetware tech. To my knowledge, other than the people over at diybio, there have been no signs of anyone with an understanding of free and open source software infrastructure (not the legalese- the toolchains) applying the concepts to the world of open source science. Your strikedown of compilers and kernels for the laboratory is pre-mature, IMHO.

- Bryan

Hey John,
Great stuff, would love to hear what you think of our work on the hardware side. The "Open Gel Box" project is an initiative to bring biotech equipment into the 21st century. We need innovation in "established" tools to make them intuitive and accessible for anyone who wants to work with DNA. To that end, a group of users from the DIYbio list got together and designed a better, faster gel system than what exists today.

Pearl Biotech is now manufacturing a complete gel electrophoresis system according to the Open Gel Box design The Pearl Gel Box is available for $199 at http://www.pearlbiotech.com. We're advocating for better equipment on all fronts, such as an Open Thermal Cycler.

Tito

Stem cells, mice, vectors, plasmids, and more will need to available outside the old boy's club that dominates modern life sciences.

This is simply never gonna happen, because of the huge irreducible expense of maintaining and manipulating these reagents.

I believe your historical facts are a little skewed. Open Biology perhaps began on the internet back with BIONET, which functioned well through the late 80's and early 90's, until the network apparently failed to grab sufficient interest for funding. (Spammers, as well, took over the network, but this could have been moderated against, with enough funding.) The motivation for "Open Biology" isn't a "new fields of science", it is as old as the field of Biology itself. However the Internet hasn't caught on that well with most scientists, because they are busy with science, not busy learning the latest Internet tools (like blogging). There have been efforts to create biology software repositories (similar to sourceforge.net except for Biology software) and these have largely failed to attract a majority of Bio-scientists too. The amount of collaboration we are blessed with on the Internet has accelerated the conversation between researchers in all fields, and that is the noticeable effect today. It would be great to accelerate this process even further, for example by expanding PLoS, encouraging all scientists to publish their working software (for example, MATLAB scripts) into open source repositories, and encouraging the people-in-the-middle (hobbyists, engineers) to publish in an intermediate form which isn't as strict as a scientific journal yet maintains some level of technological standard and legitimacy -- similar to the Internet RFC's, which started as simple technical memo's.

I am currently designing and writing an open source framework for Robotics and equipment control, for Biology/Chemical Engineering automation, which can be seen on CPAN. http://search.cpan.org/dist/Robotics/ To my surprise, this hasn't really been done yet.

The primary legal protection that must be afforded researchers in order to allow the formation of foundations is attribution. If a universal format for attribution and material transfer could be made available (vis-a-vis Science Commons, etc) and widely known/accepted, more researchers would jump on the band wagon. The issue is not that science is so advanced that the masses cannot grasp it, or that the knowledge is unintuitive - it's that access to knowledge is shielded and so is access to the institutions/individuals that produce it. I think scientists would be more open to "open source" if they were simply afforded job security and attribution if they engage in it.

Rick Smith
Twitter: @h2oindio

Dear Mr Wilbanks,

When are you next in London? Would you like to come and meet and have a chat with the digital anthropology students, and our lecturers, at University College London? Perhaps discuss your Science Commons Project over dinner with us?

Salina Christmas
MA Digital Anthropology
University College London

Mr Wilbanks -

Your use of the term "object oriented" is not technically correct, and your argument about modularity in software would stand without it.

Further, while I use Emacs every day for many hours, I don't think that the GNU toolchain was technically superior in the early years of free software.

To get to my main point, as somebody who works in software, not science, and is an advocate of free software, not open source, I think that the concept of software freedom would apply better to science than the open source concept. Free software begins with principles about people's freedom to run software, to study & modify software, the freedom to distribute software, either in its original form or modified.

The open source community, in contrast, is more interested in the distributed, open process of software development. While this is important, I think that it follows from the freedoms mentioned above.

I think the concept of freedom is what would apply better to science and knowledge that the concepts of open source, especially if the concept of freedom was carried through logically.

Obviously these freedoms to do not necessarily map directly to science or knowledge in general. This sort of thing is more the area of creative commons. But it seems to me that if scientists are interested in what they can learn from free/open source software, it would be better to try to determine what the scientific freedoms are, probably starting from some base of academic freedom. I would guess that these would involve something like the freedom to pursue inquiry, the freedom to access the works of others, the freedom to use, expand on, and redistribute others work, and the freedom to publish the results of ones work.

In some ways these already exist, and of course the open access movement has had a huge effect, but if they were to be put into practice in the same way the software freedom movement did, I think they could have a dramatic effect.

I disagree; science has been very open, at least dating back to the formation of the Royal Society. If anything, some people in modern science tend to be more selfish and believe that they have something to hide unless people give them $$$. I have told many colleagues that I'd rather publish than patent something because in my own personal opinion, patenting is a waste of time and money. I want to let people know what I've done and how I've done it, and move onto something else. Data, methodology, and tools are of great importance to all scientists. I see an awful lot of duplication of work because several groups go hack together their own bit of instrumentation rather than working with a knowledgeable instrumentation group to produce something which works reliably and can be readily duplicated for others (and easily modified by others). This sort of thing results in instruments all being one-off gizmos - a rather expensive and anarchic way to go. For example, compare the FluxNet group and their numerous disparate pieces of gear - the whole effort can be far better coordinated - with NASA's solar observation network ~mid 1960s which was set up in support of human space flight and ultimately the moon landings. FluxNet members have their data in whatever format they please, etc, etc - that makes it very difficult to work with the data from different sites. We can even compare FluxNet to the contemporary astronomical group - astronomers have highly specialized instrumentation and yet they share data without much difficulty thanks to good organization and open sharing of the tools. The same goes for people working with satellite imagers. Now if scientists were better coordinated, especially where the tools really are fairly mundane, then tools can be made much cheaper and far better especially where all the information to reproduce a specific tool are made public and free. The same goes for the software to drive those tools. As for the data, it should all be made public if possible; that is not a trivial task but it is certainly possible. NASA has done it, the World Meteorological Organization and members have done it, and various scientific interest groups are doing it. The more information is shared, the better scientists can work and the sooner private industries and consumers can benefit from useful results. Miring everything in patents is very much anti-science.

I think you are using "object oriented programming" to mean something akin to "modularity". It is worth pointing out that the linux kernel, and much of the original GNU user space utilities were written in C which is not object-oriented. Arguably the original unix implementation was open source (in that tapes of the source code were freely shared) and was also written in C. Also, GCC which is cited here and elsewhere as a key enabler to the widespread success of much open source software started with a compiler for the C language with other languages being added later. UNIX, GNU, linux, and a number of user space utilities were very highly modular without being object oriented.

I also cringe a little at the "there is no crowd" line. In the early days of UNIX, and then later GNU and linux there weren't very many people with the programming skills, hardware, and network connectivity necessary to participate in open source development. I'd imagine there are quite a few more scientists in universities with the hardware, network, and knowledge resources necessary for participation in open source science today than there were programmers with the resources to participate in open source software at many times throughout its development. Open source development works very well for small groups, but also scales well as a project accrues participants.

I take your point that the current scientific community is smaller than the current open source development community and that there are particular challenges to doing science in the open. But if the tools, incentives, and culture were available then we would be seeing significant benefits for even small scale collaborations. I happen to be of the view that the incentives and culture are the hard parts.

Thank you for writing such an interesting piece -- as someone with backgrounds in molecular biology and population genetics, I think it's not too much of an exaggeration to say that the pace of scientific progress will slow dramatically if some way to better standardize and distribute research is not found. You also nailed two of the main problems with this effort: the lack of modularity and the (related) problem that life sciences are still really in their infancy compared with the physical sciences. Gary Pisano, a professor at Harvard Business School, cites both of these issues as major impediments to the development of a self-sustaining biotechnology industry. I wonder if part of the problem isn't that everyone is attacking different problems at the same time -- maybe we need to collectively all go back to hard-core biochemistry to figure out protein-protein interactions before tackling mouse development. But regardless of the solution, I think we can all agree that the present chaos is not helping anyone. A serious problem, and one well worth devoting one's career to tackling.

A couple of things:

1. Much of this discussion is particularly relevant to biotechnology, which really stretches the concept of science towards technology business. Here high upfront levels of venture capital structuring may be more relevant than academic traditions.

2. Within science, the question of openness seems to be one of universal benefit versus vested interest. The latter mainly those of elite universities in anglosaxon countries vs the global crowd of cash-strapped and structurally-adjusted students and academics.

I may be a bit late to join this conversation, but I'd love to get your feedback regarding a project I've been developing over the past three years, and through which I hope to create (in the long run) an infrastructure for distributed 'open-source' scientific research.

In brief, what this project (called "The Open Source Science Project") does is to allow researchers to maintain public profiles, upload/manage their publication data, and to propose research projects seeking investment (in the form of microgrants and/or microloans) from individuals throughout the broader online community. What makes The OSSP unique, IMHO, is that research proposals are literally peer-reviewed (by other registered researchers), and decisions regarding funding are made by a large number of small-investors as opposed to a single grant agency which should, in theory, skew the IP incentives underlying investment in research (especially basic research) in the direction of increased openness.

Furthermore, researchers whose work receives funding from the OSSP community of investors, will maintain research logs that will be accessible to their investors (while the study is being conducted), and to the community of OSSP researchers (once a study has been completed) - which should help build relationships between investors and researchers, as well as provide research students with a 'living' document that could provide future faculty advisors or employers with a more comprehensive idea regarding how a student approaches a research question, and confronts those challenges which arise when conducting a research project.

Thus far, what I have focused on in developing The OSSP has been creating a more user-friendly interface through which research information and data can be organized/accessed; and a more open platform through which to encourage members of the non-research community to play a more active role in addressing the questions and challenges that effect us all.

If you have a chance, I'd welcome your feedback regarding this platform:

http://www.theopensourcescienceproject.com

Open Source Science? Or Distributed Science?

More like this

Basics: Standard Deviation

Seasons, short and simple

The Real Bozo Attempts to Atone: Why the DDWFTW Car Works

BIO101 - Lecture 7 - Physiology: Coordinated Response

World Opera, Collaborative Science, and Getting On The One

Documents and Data...

Marking and Tagging the Public Domain

rdf:about="Shakespeare"

Of Pepsi and ScienceBlogs...

The Problems of the GRE

Arctic Sea Ice Extent Is Not Extensive

Black Holes Won't Incinerate You, After All