Making data available to others

Yesterday we posted on our strong support for open access publishing of tax payer supported research. We are taxpayer supported scientists (at least our NIH grants are) and we consider our work to be the property of the public, who paid for it. Whenever possible (which is most of the time) we do publish in freely accessible journals. Making data freely accessible is more controversial, but we also support this, perhaps with a reasonable grace period to allow scientists to have priority for data they expended effort to collect and with reasonable safeguards for confidentiality and privacy when human subjects are involved (we are, after all, epidemiologists). Adhering to this principle can be uncomfortable and inconvenient and sometimes aids and abets those with whom we have substantial differences over matters of importance, but a principle is a principle and violating it has to be well justified. We were put in mind of this recently when Queens University Belfast was ordered by British authorities to turn over tree ring data bearing on climate change to a climate change skeptic:

The climate data wars have taken a new turn. A leading British university has been told it must release data on tree rings dating back more than 7000 years to an amateur climate analyst and climate sceptic.

The ruling, which could have important repercussions for environmental research in the UK, comes from the government's deputy information commissioner Graham Smith. In January he caused consternation at the height of the "climategate" affair by criticising the way that the University of East Anglia in Norwich, UK, handled sceptics' requests for data from its Climatic Research Unit.

Now, following a three-year dispute between banker and climate sceptic Doug Keenan and Queens University Belfast, Smith has told the university to hand over to Keenan the results of its 40-year investigation of Irish oak-tree growth rings.

The ruling sends a strong signal that scientists at public institutions such as universities cannot claim their data is their or their university's private property. (Fred Pearce, New Scientist)

I am not in the least skeptical about anthropogenic global warming and have no particular sympathy for those who are. That's only the backdrop to this problem, however. I can see both sides of this dispute, but it does seem that scientific data used to justify important conclusions, either for policy or science or both, should be open to public inspection and verification. That's a reasonable principle. I can imagine there would be instances when it shouldn't hold, but this doesn't seem to me to be one of them. Since this is an administrative ruling it isn't clear that it creates any kind of precedent in Common Law (although I am not a lawyer either in the US or the UK), this is the kind of case that if it were precedent setting is one where we might rightly speak of "bad facts making bad law." I don't know on what grounds scientists have for not sharing their data, even if they believe the person they are giving it to is up to mischief.

Instead of giving it to this person as an individual, why not just put it up on the internet for all to see? I think that should be the default, whether it is a controversial topic or not.

More like this

I'm not as happy to share data are you are, Revere, especially if the time frame for exclusivity is as has been suggested, about 6 months. Suppose I have spent many years collecting data and am just getting into the analysis of it, and it is a complex data set which might yield 5-6 papers. I get one or two out (taking into account getting my multiple authors on multiple continents to make their contributions, sign off, etc.) and meanwhile I have to release the data before I have gotten to those other 2-3 papers. And someone comes along and scoops up the data, analyzes it (kind of) and gets it accepted into a journal, perhaps while I am busy getting approval from those same multiple sites. I know of someone who stole (yes I think that is the right word) a CDC-produced text on epidemiology which couldn't be copyrighted and then published it as if it were his own, with "acknowledgment" of the original CDC authors...) I can imagine a whole mini-industry of secondary data grabbers. What restrictions would apply to them? would these "authors" get priority in publication of data? what if a journal received two papers - one from the original authors and another from the secondary analysts? And what if they came to different conclusions? This is a minefield. I'm totally on board with open access for publications, but data, especially if it is 6 months or something less than say 2 years? that's another matter.

I see your point suzyf, but still think I'd come down on revere's side on this one. There would have to be some consideration towards allowing you first crack at your own data, and maybe 6 months is too short, but on the whole I think the benefits of publicly available data outweigh the disadvantages.

I don't think secondary use of data is necessarily a bad thing. If anything, further scrutiny and re-analysis of data may catch errors or provide alternate ways of interpreting a data set. It would also make meta-analysis of data from multiple studies easier.

I agree with suzyf on this one. I'm all for public access papers, but not public access data. At least not on a time frame of 6 months.

I'd say it's better to let the researcher hold onto the data for 6 years.

That said, researchers should reveal data as soon as they can.

As a British taxpayer, the data above belongs to me, and my fellow taxpayers, period.

Unless there is a clear reason otherwise, eg. national security, data should be rapidly made publicly available as Revere suggests.

I am currently in a position where I would like to do some research of my own, but not in a position to find funding. So I work for a living instead, and pay taxes, some of which goes on scientific research.

If researchers want to keep control of 'their' data, there is a simple solution, they fund their own research, and don't expect me to pay for their food, mortgages, etc.

The arrogance of Mike Baillie and Queens University is truly breathtaking.

I am glad to see that everyone agrees that data should be available after a reasonable period.

Whether that is 6 months, 2 years or even 5 I think it would be a great step forward and a necessary step to continue the opening up of the scientific endeavour.

Perhaps the data exclusivity period could be related to the length of the data collection period, ie 50% with a minimum of 18 months and a maximum of 5 years?

By felix.oxley (not verified) on 21 Apr 2010 #permalink

kagiso, who owns data is determined by copyright law, and the law says that the creator of the data owns it. It has never been the law that the taxpayer owns government property, the Government owns government property, and the Government is a different entity to the taxpayer (just as shareholders in a company do not own the property of the company, the company owns its property.)

As a engineer (almost scientist I know ...) I think that if you are writing a paper and you want people to take it seriously, or even more act on the conclusions of that paper you should provide your data with it.
I don't really care that you MIGHT get 4-5 other research papers from the same data and you want to keep it a secret. You can do that and release all the papers at once with the associated data. If you don't release data don't expect me (or others like me) to take you seriously.
I may be biased by my profession, but dealing with people who want to sell me the next generation of equipment that will solve all my problems and which expect me to believe them without any supporting data, got me to this attitude.

found an international organization to estimate
research,software,music,art contributions and reward
them accordingly.
Money comes from member countries according to their
estimated benefit from the contributions.
..who in turn collect the money from citizens,companies
according to their estimated benefit from the
current and future research

You are aware, aren't you (aren't you??) that providing access to research data, even publicly funded data, is not necessarily up to the academics here in the UK -- it is up to the funding body of any given research, including when this is the government itself. And, in fact, here in the UK access to climate data is only available for payment of a fee to the government, it is not available for free to anyone but the actual research scientists who worked on it, and they are not allowed to pass it on for free to anyone else, even in response to an FOI request. Furthermore, this somewhat shortsighted policy has resulted in other governments (especially European ones) setting similar restrictions for access to public data.

This was NOTED in the inquiry into the CRU emails.

There are additional issues of whether or not access to research data is "legitimate", but I wanted to make sure that this is more widely known. The fact that not all data are made freely available to the public is not down to the academics being vilified for it.

This is true for more fields than in climate, it is a general policy. My husband is a sociologist, and has had issues in the past with not being able to release statistics data collected on access to social and medical services in a local-government funded study, because the local government was uninterested in allowing that data to be freely shared. This is a problem of the UK in general.

For all the people squawking about how evil and obstructive the researchers are, and how you have a "right" to this data, it would really help if you did a little fact-checking before launching the righteous outrage.

By Luna_the_cat (not verified) on 22 Apr 2010 #permalink

Luna: The legal standing/requirements either in the UK or the US were not at all the point. While this is not settled law in the US, most researchers consider the data they/we generate to be private property, and it may well be according to some court. I was speaking as a scientist and was expressing the view that if you want people to take your conclusions and interpretations seriously (and all scientists want that) then you have to be prepared, even willing, to share your data. If they grounds for not sharing are, "I am not permitted to share by my funder," that's an adequate response for the scientist (I didn't accuse any scientist of acting illegally or unethically) and I left it open there might be an adequate response. But it is beside the point of what the principle should be, which was our focus.

Let me clarify - I'm not against sharing data, but I think it's fair if I and my collaborators have spent a long time generating the data, that it's only fair that we(and our funding agency/agencies) have the first crack at analyzing and interpreting it. And a first crack will take more than 6 months. I am talking about complex datasets and one paper cannot contain all the relevant findings and conclusions. I work more towards the social and economic side of things so often the social and economic context is quite important - was there a drought that year that caused people to sell off their assets and be poorer in general? was there an election? did a new hospital open up so people changed their health seeking behavior? did treatment policy or drug prices change during the time of the study? etc. A casual "data miner" would not have any of that background.

One further wrinkle. I usually work outside of the US where the data ownership is shared with the country in which it is collected. They have rights too.

And a word for Kagiso. So you'd like to do research but are too busy or whatever to find yourself some funding.... so maybe you'd like to have a crack at some of the data I and my colleagues have collected? is that it? the data that we sweated over, spent weekends and late nights to develop into a writing a project proposal, then spent years in the field carrying out the project, and now that we have the data, it's somehow "yours" for the taking? I don't think so.... You can have a look when we've finished analysing and writing it up, thanks. Happy to share.

The public health agency that I work for in the US does put all or virtually all the data on the internet for public use as a matter of policy. The results are complex. The data get much wider use and are exploited more successfully, which is good. There is a lot of "semi-duplicate" publication that happens, when one research group uses the data for a very similar but not identical purpose as a different research group. Outside users themselves sometimes develop curious "ownership" attitudes towards the data, feeling that they are experts in the use of a particular data set with almost more rights and knowledge than the original data collectors. Some people make almost their whole careers out of using public use data. Another thing that happens is that outside users misunderstand the data sometimes and make errors that are hard to correct. On the principle of "bad research drives out good" let's say that two groups have a similar project in mind. The one that does a sloppy quick job and gets it published rapidly may preclude publication of a better and more thoughtful analysis on the same topic. Preparation and documentation of data for public use is difficult and time-consuming. This is definitely the wave of the future but will come with the same problems as we already see on the internet, with multiple sources of information, some more and some less reliable. Some large epidemiologic research groups have developed approaches to "sharing" which enables them to still keep control over what gets done with their data, so that alternative analyses are not going to appear unless the group permits it. And with enough variables, probably any one analysis could be undermined by a diligent and well-funded researcher finding some way to exaggerate or diminish the effects found. Not a straightforward set of issues.

Revere: That's fair enough, and I suspect I'm on something of a hair trigger about it because the media here are full of kagiso's ilk baying for the blood of "obstructionist" academics.

But even so, I have to agree with suzyf and Carmela on certain issues: 6 months only to get first crack at complex data is WAY(!) too short; and I have observed "the one that does a sloppy quick job and gets it published rapidly may preclude publication of a better and more thoughtful analysis on the same topic" on multiple occasions. There needs to be a fair window for researchers to use their own data for original work, and when that data makes its way into the public arena (notice that I do say "when", not "if") it is may be necessary that it still be managed and regulated to some degree -- especially on contentious and complex issues. I don't think I favour only allowing alternative analyses with the approval of the original group, but without some channeling and oversight to ensure that the data are primarily available to researchers acting in good faith and are dealt with in a legitimate fashion, the potential for misuse is huge -- even as things stand, we've seen "analysts" NOT acting in good faith attempting to discredit valid research by hard-to-spot distortion of available data.

By Luna_the_cat (not verified) on 22 Apr 2010 #permalink

Luna: The 6 months came from our friend suzyf (yes, we are indeed longtime friends), not from me. I don't have a fixed time period and I think what we are talking about is not all data but "published" data, i.e., data underlying published works and if there is a clock it shouldn't start to run until publication. But these are details, although I concede details can be crucial. The principle, however, is what I was discussing. I'll say it again. If you want people to take your science seriously, you have to be prepared to show your data. And I don't consider scientific data financed by the public to be private property. I don't consider scientific data in general to be private property but when financed by private means it is what it is. But public financing is another argument against privatizing science, which by its nature is intersubjective and public. At least that's our view.

Luna seems to assume that its the original researchers who are acting in good faith and the secondary users who may be trying to discredit valid analyses. But remember that the opposite may also be the case - the original researchers have some kind of agenda and are trying to cover something up, and if the data were available to secondary users, they might be able to show some problem with the original analyses.

@Carmela -- sorry, in my statements, I was also responding to yours, to wit Some large epidemiologic research groups have developed approaches to "sharing" which enables them to still keep control over what gets done with their data, so that alternative analyses are not going to appear unless the group permits it." I had taken this to imply approval of this as a model. Let me make this clear -- I don't, at least at first glance. I do not think that the owner should have the right to approve or deny publication of other analyses. THAT would definitely lend itself to abuses as well, including where, as you say, the original researchers may not be acting in good faith.

However....having said that, in the climate arena specifically I see a lot of non-climatologist "analysts" who have a specific goal of the subtle distortion of data in order to cast doubt on work which I believe is done in good faith (that is, with the purpose of getting a better idea of what the reality is). This isn't so much my assumption, as simply a description of what I've seen, famously starting with the "wars" between Mann and Macintyre.

@revere -- in principle, I do actually agree with you. All data should be available. I somehow missed that the 6 mos. was suzyf's suggestion. (I still think that is too short a time period.) In practice, however, I can see wrinkles with this. One of the wrinkles is still the fact that the people and countries who do the physical work to gather the data often place long-term restrictions on access to that data in order to recoup investment, and I honestly don't know what will change that worldwide, since recouping investment is a legitimate argument.

By Luna_the_cat (not verified) on 22 Apr 2010 #permalink

One practical difficulty with implementing a data-sharing policy is that different fields (astronomy, genomics, climate research, etc.) may well have different "natural" timescales, depending on how long it takes to do experiments and so forth.

I'm not as happy to share data are you are, Revere, especially if the time frame for exclusivity is as has been suggested, about 6 months. Suppose I have spent many years collecting data and am just getting into the analysis of it, and it is a complex data set which might yield 5-6 papers. I get one or two out (taking into account getting my multiple authors on multiple continents to make their contributions, sign off, etc.) and meanwhile I have to release the data before I have gotten to those other 2-3 papers. And someone comes along and scoops up the data, analyzes it (kind of) and gets it accepted into a journal, perhaps while I am busy getting approval from those same multiple sites. I know of someone who stole (yes I think that is the right word) a CDC-produced text on epidemiology which couldn't be copyrighted and then published it as if it were his own, with "acknowledgment" of the original CDC authors...) I can imagine a whole mini-industry of secondary data grabbers. What restrictions would apply to them? would these "authors" get priority in publication of data? what if a journal received two papers - one from the original authors and another from the secondary analysts? And what if they came to different conclusions? This is a minefield. I'm totally on board with open access for publications, but data, especially if it is 6 months or something less than say 2 years? that's another matter.

I'm not as happy to share data are you are, Revere, especially if the time frame for exclusivity is as has been suggested, about 6 months. Suppose I have spent many years collecting data and am just getting into the analysis of it, and it is a complex data set which might yield 5-6 papers. I get one or two out (taking into account getting my multiple authors on multiple continents to make their contributions, sign off, etc.) and meanwhile I have to release the data before I have gotten to those other 2-3 papers. And someone comes along and scoops up the data, analyzes it (kind of) and gets it accepted into a journal, perhaps while I am busy getting approval from those same multiple sites. I know of someone who stole (yes I think that is the right word) a CDC-produced text on epidemiology which couldn't be copyrighted and then published it as if it were his own, with "acknowledgment" of the original CDC authors...) I can imagine a whole mini-industry of secondary data grabbers. What restrictions would apply to them? would these "authors" get priority in publication of data? what if a journal received two papers - one from the original authors and another from the secondary analysts? And what if they came to different conclusions? This is a minefield. I'm totally on board with open access for publications, but data, especially if it is 6 months or something less than say 2 years? that's another matter.