Panton Principles: Principles for Open Data in Science

Here's what they're about:

The first draft of Panton Principles was written in July 2009 by Peter Murray-Rust, Cameron Neylon, Rufus Pollock and John Wilbanks at the Panton Arms on Panton Street in Cambridge, UK, just down from the Chemistry Faculty where Peter works.

They were then refined with the help of the members of the Open Knowledge Foundation Working Group on Open Data in Science and were officially launched in February 2010.

Here they are:

Science is based on building on, reusing and openly criticising the published body of scientific knowledge.

For science to effectively function, and for society to reap the full benefits from scientific endeavours, it is crucial that science data be made open.

By open data in science we mean that it is freely available on the public internet permitting any user to download, copy, analyse, re-process, pass them to software or use them for any other purpose without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. To this end data related to published science should be explicitly placed in the public domain.

Formally, we recommend adopting and acting on the following principles:

  1. Where data or collections of data are published it is critical that they be published with a clear and explicit statement of the wishes and expectations of the publishers with respect to re-use and re-purposing of individual data elements, the whole data collection, and subsets of the collection. This statement should be precise, irrevocable, and based on an appropriate and recognized legal statement in the form of a waiver or license.

    When publishing data make an explicit and robust statement of your wishes.

  2. Many widely recognized licenses are not intended for, and are not appropriate for, data or collections of data. A variety of waivers and licenses that are designed for and appropriate for the treatment of data are described here. Creative Commons licenses (apart from CCZero), GFDL, GPL, BSD, etc are NOT appropriate for data and their use is STRONGLY discouraged.

    Use a recognized waiver or license that is appropriate for data.

  3. The use of licenses which limit commercial re-use or limit the production of derivative works by excluding use for particular purposes or by specific persons or organizations is STRONGLY discouraged. These licenses make it impossible to effectively integrate and re-purpose datasets and prevent commercial activities that could be used to support data preservation.

    If you want your data to be effectively used and added to by others it should be open as defined by the Open Knowledge/Data Definition - in particular non-commercial and other restrictive clauses should not be used.

  4. Furthermore, in science it is STRONGLY recommended that data, especially where publicly funded, be explicitly placed in the public domain via the use of the Public Domain Dedication and Licence or Creative Commons Zero Waiver. This is in keeping with the public funding of much scientific research and the general ethos of sharing and re-use within the scientific community.

    Explicit dedication of data underlying published science into the public domain via PDDL or CCZero is strongly recommended and ensures compliance with both the Science Commons Protocol for Implementing Open Access Data and the Open Knowledge/Data Definition.

Authored by:

Peter Murray-Rust, University of Cambridge (UK)
Cameron Neylon, STFC (UK)
Rufus Pollock, Open Knowledge Foundation and University of Cambridge (UK)
John Wilbanks, Science Commons (USA)

With the help of the members of the Open Knowledge Foundation Working Group on Open Data in Science

And you can endorse them here.

(via Chris Leonard on Friendfeed.)

More like this

Actually, I don't endorse it. I directly oppose this text as written, and I STRONGLY (sic) discourage its adoption. Creative Commons, GPL or other licenses that restrict commercial use is perfectly appropriate. After all, the purpose of releasing research data is for others to be able to check the data, and to compare with their own. Accuracy and integrity.

But that has nothing to do with being able to use it for commercial purposes. You can do so just fine with a noncommercial license.

The use of ShareAlike or GPL (to the extent GPL is appropriate for data at all of course) is a better license than an unrestricted one; it means everybody dipping into the commons of data has to play along and share their own data as well. After all, why should researchers be the only ones to share their data freely, with nobody else doing so?

Janne, as I understand it (and my understanding is limited to say the least -- please any open data experts feel free to chime in!), the idea is to make data usable and reusable by the entire community, to be able to combine data from different experiments or sets of measurements on top of making sure that results are repeatable.

For example, if someone has to combine a large amount of climate data to perform a meta-analysis, they would be restricted by the least open license of all the data they were using. By making all data public domain, that problem disappears.

Janne, I understand where you are coming from and to some extent this is exactly the same place the Open Knowledge Foundation is coming from as well (with the exception of the non-commercial bit). The key point is that this is not at all merely about accuracy and integrity checking this is about re-use and making sure it is possible. Without re-capitulating screeds of existing material there are a few key points.

GPL/CC licences do not work for data across jurisdictions. They rely on copyright. Data in most places cannot be copyrighted. Where it can is inconsistent. Whatever else you do don't use copyright licences on data because they will scare off the good guys and the bad guys will simply ignore them because they are un-enforceable. You can in principle use contract law to create similar restrictions (and the ODbL does this) but you need to ask yourself whether you want to bring contract law into this space. The consequences might not be what you want.

Share-alike we agreed to disagree but consider the following situation. You want to combine ethnographic data that is share-alike with health data that can't be released. If you do this, depending on the SA terms you either aren't allowed at all, or cannot release any results or derivative data because the privacy issues of the health data trump the SA clause. This gets really really potentially messy. The best legal minds have been thinking this through and don't agree on the details, they just agree its complicated.

The problem with noncommercial terms is that they split the commons that you want people to dip into. This is not so good for content but hasn't caused major problems because so far content sharing is free. For data it is a disaster because we need to find ways to make it pay for archiving, re-use, care, and indexing. We are losing databases left right and centre because people won't support them directly. Non-commercial terms for data will lead directly to a situation where there is proprietary (click wrapped) data with high quality support (but not very much of it and only in particular areas) and a mess of badly supported non-commercial data.

The only way to make this work IMO is to grow the pie by reconstructing a data commons which enables and encourages people to make money because they are confident of their rights to build on data. We need to bring money in by making commercial exploitation viable. This works for weather data in the US and legal data and supports a vibrant ecosystem of tools and data support.

As John says, and John Wilbanks is most articulate on this, exporting restrictions in the name of freedom can lead to serious unintended effects. HapMap had to change its data rules to make the system work. Some genome databases have been taken private despite the best intentions of most of those involved. The public domain protects the legal status of data, protects the ability of people to re-use it, and most of all makes sure that we can build confidently on a solid foundation of useable data. Which is what science is about. If you can't re-use it, it isn't science IMO.

Janne, you couldn't be more wrong. The primary purpose of sharing data is re-use, not fact checking, and copyleft simply does not work for re-use of data because it relies on the wrong law (copyright) and because attribution stacking doesn't scale.

As for "noncommercial", go ahead and define that term in a useful manner -- one that makes clear, say, 90% of test cases ("is this a commercial use or not?"). Creative Commons hasn't been able to do it so I'll be impressed if you can. Unclear licensing requirements do not scale.

Cameron said all that and more, and better, but I thought a short version might be useful.

Releasing data with non-commercial terms does not preclude _also_ selling it for commercial use to those who want it. The researcher owns the data, and while the openly released data can't be used for commercial purpose, the data owner can certainly go beyond that and strike separate deals as well. Nothing precludes dual-licensing.

Here's my beef with requiring commercial use:

A researcher has a data set. A commercial developer has another one. They're not immediately useful by themselves, bit if they were combined, a very promising commercial product would be possible.

If the researcher's data was open but non-commercial, the two can meet and come to an agreement of sorts - perhaps they agree to cross-license their data, perhaps they go into business together, perhaps something else.

If the researcher's data is open for commercial use, the developer can take the research data and create the product. The researcher, meanwhile, can not. It introduces a fundamental asymmetry between actors.

The argument against double licensing (or special case licensing) is one of scale. This works fine if you're talking about one dataset but if you want to combine 5000 datasets then writing to each rights holder and creating a special legal agreement with each one just isn't feasible. Also dual licensing causes problems downstream - can this commercial provider then distribute the results of their work? If so how, and under what licence? And can that be mixed with another dataset?

Parenthetically I personally disagree the researcher owns the data. The funder has the moral rights over data use in my view, not the researcher. The funder may choose to give those rights to the researcher but I don't see that they are obligated to.

I don't really understand your example though - the idea is to create symmetry. If both allow commercial use then both can use either data set or combine both. I think your argument is that the researcher should be able to use licensing of the data as leverage to force the company to make data available - this might work but if neither can get at the data in the first place how are they going to discover each other? It also assumes that commercial data will be secret. There is increasing evidence that this is not the best way to get a return on your investment in data.

"Parenthetically I personally disagree the researcher owns the data."

That the researcher owns their results was not my opinion; in at least some places (like Sweden) it's a legal fact. The research institution may have the rights to a cut of any profits from it, but the owner is the people who did the work.

Cameron, I am saying that forcing one party to give up their assets without forcing other parties to do the same is asymmetrical. What would stop a commercial party to use the researchers data without giving them a fair part of the resulting revenue?

"[...]this might work but if neither can get at the data in the first place how are they going to discover each other?"

But the researcher _is_ releasing their data, for others to discover it. The commercial user can simply not make money off it without entering into negotiation with the researcher who owns it.

To add to discussion and correct some misconceptions:

@Janne: as Cameron points out the Open Knowledge Foundation's general position is one of supporting open data where "open" data includes data made available under licenses with attribution and share-alike clauses, though non-commercial restrictions are definitely not permitted (see http://www.opendefinition.org/ for precise details). The reason for excluding non-commercial is simple: share-alike is compatible with a commons open to everyone but non-commercial is not.

Panton Principles 1-3 are, in essence, saying make data "open" in the sense of http://www.opendefinition.org/. Principle 4 goes beyond this to specifically recommend public-domain only for data related to published science, especially where the work is publicly funded.

The rationale for this "stronger" position, at least for me, was that a) science has existing (very) strong norms for attribution (and, to a lesser extent, share-alike) b) science has strong up-front funding support from society which reduces some of the risks that share-alike addresses.

That said, I should emphasize that, in my view at least, the key feature is that the data be made open -- public domain dedication/licensing is "strongly recommended" but if you end up with an attribution or even share-alike type license that is still far, far better than not making the data available at all, or licensing it under non-commercial or other conditions.

@Bill: I remain completely unconvinced by the attribution stacking argument and I find its logic in this area rather incoherent (we expect attribution to happen even with PD since it's part of the community norms in science. As such attribution stacking happens with or without a license -- unless attribution actually won't be happening which is a serious issue ...). I'm also unclear why copyleft does not work for DBs (I agree using CC licenses for it isn't a good idea but there are others such as the Open Database License (ODbL). For more detail see earlier posts such as: http://blog.okfn.org/2009/02/02/open-data-openness-and-licensing/ and http://blog.okfn.org/2009/02/09/comments-on-the-science-commons-protoco…

@Cameron: the contract point about the ODbL is, in my view, very minor and is turning into a bit of a misconception so I should correct it. The main "enforcement" mechanism of the SA conditions in the ODbL remains existing IP rights whether copyright or sui-generis DB rights. Even the US where copyright in data(bases) is "weak" some copyright likely exists in most situations -- though of course not phone directories! I'd also point out that CC licenses also operate as contracts, at least in common-law jurisdictions such as the US and the UK so it's not as if the ODbL is being particularly unusual (though the ODbL is more explicit about this than CC licenses ...)

Lastly, I think it important emphasize that I don't see Share-Alike as non-commercial or anti-commercial. In the free/open-source software world there is lots of commercial activity around codebases that are GPL'd. Of course, it definitely makes it harder for some commercial users to use the information if they want to use proprietarily or directly combine it with proprietary data and it also can cause problems when intermixing with other sets of data with openness restrictions (such as those caused by privacy restrictions). However, at the same time, I would point out that it can also encourage commercial use since commercial participants know their contributions won't be "free-ridden" upon.