Should researchers share data?

A colleague of mine (who has time to read actual printed-on-paper newspapers in the morning) pointed me toward an essay by Andrew Vickers in the New York Times (22 January 2008) wondering why cancer researchers are so unwilling to share their data. Here's Vickers' point of entry to the issue:

[A]s a statistician who designs and analyzes cancer studies, I regularly ask other researchers to provide additional information or raw data. Sometimes I want to use the data to test out a new idea or method of statistical analysis. And knowing exactly what happened in past studies can help me design better research for the future. Occasionally, however, there are statistical analyses I could run that might make an immediate and important impact on the lives of cancer patients.

You'd think cancer researchers would welcome collaborations with statisticians, since statisticians are the ones trained to work out what the data show, and with what confidence. Moreover, once data have been collected, you'd think cancer researchers would want to make sure that the maximum amount of knowledge be squeezed out of them -- bringing us all closer to understanding the phenomena they're studying.

As Vickers tells it, cancer researchers seem to have other concerns they find more pressing, since his requests for data and other sorts of information are often refused. Among the reasons the researchers give to keep their data to themselves:

  • Data sharing would violate patient privacy.
  • It would be too difficult to put together a data set.
  • The analysis of the data might "cast doubt" on the results of the researchers who collected the data.
  • The person asking for the data might use "invalid methods" to analyze it.
  • The researchers being asked for the data "might consider a similar analysis at some point in the future".

Vickers responds to these:

  • It possible (and not even very hard) to replace patient names with codes to protect patient privacy.
  • One usually has to put together a data set in order to publish one's results -- so why would sharing data after you've published require putting together another data set?
  • It's a statistician's job to recognize valid and invalid methods for data analysis, and the scientific community would certainly weigh in with its judgment in case the statistician made a bad call.
  • The researchers who said they might want to do a similar further analysis of their data to the one Vickers proposed haven't yet, years later.

As for whether a further analysis could cast doubt on researchers' results, I would have thought this falls pretty squarely into the "self-correcting nature of science" bin -- which is to say, job one is to get results that are accurate reflections on the phenomena you're studying. If your initial results don't get the reality quite right, shouldn't you welcome a reanalysis that closes that gap?

(Of course, I suppose there are cases in which the worry is that the reanalysis will cast doubt on one's claim actually to have performed the research one has reported. This is another sort of situation where science is touted as being self-correcting -- and where clinging too tightly to one's data might be a clue to a bad conscience.)

The Vickers essay makes the case that, especially in cancer research, real cancer patients may be the ones most hurt by researchers' territoriality with their data. The results that might be drawn from the data are not mere abstractions, but could hold the key to better treatments or even cures.

Are there sound scientific reasons not to share such data with other researchers?

In a post of yore, I wrote:

[W]e want to make sure that the conclusions we draw from the data we get are as unbiased as possible. Looking at data can sometimes feel like looking at clouds ("Look! A ducky!"), but scientists want to figure out what the data tells us about the phenomenon -- not about the ways we're predisposed to see that phenomenon. In order to ensure that the researchers (and patients) are not too influenced by their hunches, you make the clinical trial double-blind: while the study is underway and the data is being collected, neither study participants nor researchers know which participants are in the control group and which are in the treatment group. And, at the end of it all, rather than just giving an impression of what the data means, the researchers turn to statistical analyses to work up the data. These analyses, when properly applied, give some context to the result -- what's the chance that the effect we saw (or the effect we didn't see) can be attributed to random chance or sampling error rather than its being an accurate reflection of the phenomenon under study?

The statistical analyses you intend to use point to the sample size you need to examine to achieve the desired confidence in your result. It's also likely that statistical considerations play a part in deciding the proper duration of the study (which, of course, will have some effect on setting cut-off dates for data collection). For the purposes of clean statistical analyses, you have to specify your hypothesis (and the protocol you will use to explore it) up front, and you can't use the data you've collected to support the post hoc hypotheses that may occur to you as you look at the data -- to examine these hypotheses, you have to set up brand new studies.

I've added the bold emphasis to highlight the official objection I've heard to data mining -- namely, that using data to test post hoc hypotheses is to be avoided. But, presumably, since this is the kind of objection raised by statisticians, a statistician ought to be able to make a reasonable determination about what kinds of hypotheses are properly testable and what kinds of hypotheses are not properly testable given a particular data set generated with a particular experimental protocol. Besides, Vickers says that in large part, he'd be using the requested data sets to test-drive new methods and to plan better protocols for future experiments -- so these are cases where the data would be used to test hypotheses about methodology rather than hypotheses about cancer treatments.

And surely, there seem to be other good reasons to lean towards sharing data rather than not. For one thing, there's that whole norm of "communism", the commitment scientists are supposed to have that the knowledge is a shared resource of the scientific community. (The norm of organized skepticism might also make sharing or data, rather than secrecy about data -- at least after you've published your own conclusions about them -- the natural default position of the community.) For another thing, the funders of the research -- whether the federal government or a private foundation, or even a pharmaceutical company -- presumably funded it because they have an interest in coming up with better treatments. Arguably, this happens faster and better with freer communication of results and better coordination of the efforts of researchers.

And then there are the human subjects of the research. They have undertaken a risk by participating in the research. That risk is supposed to be offset by the benefits of the knowledge gained from the research. If that data sits unanalyzed, the benefits of the research are decreased and the risks undertaken by the human subjects are harder to justify. Moreover, to the extent that sitting on data instead of sharing it requires other researchers to go out and get more data of their own, this means that more human subjects are exposed to risk than might be necessary to answer the scientific questions posed in the research.

As Vickers notes, though, the researchers' proprietary attitude toward their data is not mysterious given the context in which their careers are judged:

[T]he real issue here has more to do with status and career than with any loftier considerations. Scientists don't want to be scooped by their own data, or have someone else challenge their conclusions with a new analysis.

If sharing your data means you give another scientist an edge over you in the battle for scarce resources (positions, funding, high impact publications, etc.), of course you're going to think it's a bad idea to share your data. To do otherwise is to put yourself in a position where you might not get to do science at all -- and then what good will you be to the cancer patients your research might help?

But the scientific ecosystem hasn't been this way forever. Maybe instances where what's good for a scientist's career runs counter to what's good for the people your scientific research is supposed to help are reason enough to consider how that scientific ecosystem could be different.

More like this

Vickers' answers to some of the complaints are a bit to simplistic. There really are patient privacy issues. Just removing patient names means you still know that patient X who is age Y, gender Z, and gene profile Q (including information about skin pigmentation) visited a hospital for tests on a list of specific days. That is not anonymous and some of the relevant data is also what makes it not-anonymous. There are ways around this, but before data is transmitted out of a hospital, it takes a nontrivial amount of work.

Also, while in a fantasy world we all keep our data in perfect cleanliness with all ancillary information in files, this is usually not reality. The author may be able to understand everything, but it does take work to make it understandable to others. Yes all data should be kept in a form that is easily understood to others, but that is often not done. I've given data and used others' data. Even in the best cases, I'm using hours to days of that persons' time figuring out everything that was done in every step of data collection.

As for statisticians' jobs, I assume Vickers is a respected statistician at a top institution and can be trusted, but if you get emails from people you don't know and lesser known institutions, you might want a better understanding of what will be tested with the data.

The future analysis excuse is inexcusable. Unless it is in one's short term plans, if someone else wants to do something and put your name on a paper, there's little reason to complain.

Despite my negativity here, I think sharing should be the goal, but sometimes the details really are challenging.

Don't most publications and grants have a clause that require that, if data has been published in a paper, that it must be shared by the authors to the public?

in a fantasy world we all keep our data in perfect cleanliness with all ancillary information in files, this is usually not reality

I know what you mean, and am surely guilty of infringements here myself -- but really, if you are not keeping your data in order then you're not doing the job you're being paid for. If there were a strong community norm that supported data sharing, there would also be a much higher regard for good record-keeping.

I'm not claiming we should all keep 100% perfect records, but surely there's a minimum standard that won't require "hours to days" of someone's time just to figure out what they did. Also, that's presumably work they'll have to do anyway if ever they want to publish in connection with that dataset. "Hours" doesn't sound so bad, but "days" --- do you really trust the data at the end of that?

The data giver has hundreds of gigabytes of data on several computer systems in several locations. They have a single file that says where all the data is stored, but to transfer the data, it all needs to be merged into one location. In addition, each person has a list of information about precisely what was run. That information is hard to interprete without the site specific software that is used to collect the data. Thus, internal sharing might be easy, but to share externally would also require gathering all the software and scripts used to process the data.

Sure this is all possible and is often done, but the threshold for willingness to do this is higher than some random person coming by and asking for the data. If the person has a full proposal about what will be done and promises to include you in the discussion and possibly as an author, if appropriate, then the motive to help out is much higher.

And yes, I've given everyone who has asked for parts of my data the data, but no one has asked for an entire data set yet.

but really, if you are not keeping your data in order then you're not doing the job you're being paid for.

after I got done ROFLMAO, I realized there is a serious point here which needs to be addressed. because of Janet's comments as well.

Taking the tone you've both taken here is just as corrosive to the scientific enterprise as false accusations of fraud and muttering around the watercooler about "not trusting" someone's data. As Jon Stewart said to bow-tie boy, just stop it.

sorry, but this is the way it is in real science. the obligation is that the whole data set can be reconstructed from the bottom up (for what 7 years past collection or something?) if necessary. the obligation is not that one's data should be available all nicely notated, ready for instant emailing and obvious for any comer to interpret. I just don't see why this should be the case unless you are suggesting that fraud is so bad in science that this step is necessary. I don't see it. The problem, such as it is, is just not this big.

the burden, however, would be tremendous. bsci lays out one scenario. other scientists have other scenarios by which the data that make it into one paper or another would take a long time to assemble and notate into a format that anyone could understand without the original scientist present. multiply this by the number of studies collected over years of work and you are into nightmare territory. scientific output would crawl to a halt.

and that's even before we get to the worst part. being a scientist requires making decisions about what is "good" and "bad" data. How to analyze, how to interpret, how to categorize, how to design. the list goes on and on. you all have been around internet discussion where the placement of a comma turns into a long and healthy debate, are you not? do you want this run of the mill attention paid to every research program? every freakin' experiment you do?

saaaay bill, you claim the antibody was bad here- how do you know that. what other experiments did you do? why did you do it this way? oh, someone else in the lab advised you? based on what data? I want to see it all. what? you threw this out cause is was screwed up and you thought the evil-postdoc sneezed in your experiments? can you document that?

...and I'm just getting warmed up. The point being that it is stupid to take this adversarial approach to the future conduct of science unless we have damn good reason to do so. and if we do not have evidence of wide-spread wrongdoing, these types of insinuations advanced in the OP and some comments are corrosive at best.

I'm not sure where I'm making an insinuation in the post (aside from the parenthetical note that people who are making up their data rather than collecting it would obviously not want to share their data -- which was not meant to be a claim that this is at all a common thing for scientists to be doing!!).

I've said before that the Mertonian norms are ideals that conditions on the ground make it quite hard to live up to. Indeed, in the last two paragraphs of the post, I note the the realities of what a scientist has to look out for (funding, publications, etc.) are such that good and honest scientists have perfectly rational reasons not to share data when asked.

So why do we have to assume those realities are how it's always going to be? Why, if shown good reasons to think these realities have some significant down-sides, can't we discuss different ways to structure scientific activity? Surely just considering changes isn't corrosive, is it?

Surely just considering changes isn't corrosive, is it?

It can be. It is very difficult to assert that you want things to be "better" than they are without implying that there is some problem or deficiency with the status quo. The question of to what degree a particular audience might feel "science is great but look how much better it could be" versus "something is rotten in Denmark, how can we stop this horrible fraud" when faced with such "consideration" is an empirical one, I realize.

To my eye, Bill's approach is a little too much like the fascist approach to state violation of individual liberty. As in, "if you aren't doing anything wrong, why do you care if the gov is recording your emails and phone calls?". The response is similar in that the repressive measures put in place to catch a few "evil doers" may have costs for non-evil-doers that are unreasonably high.

I'm in favor of making public original data so the people funding the project get the most bang for their buck. If you're going to spend big bucks and lots of time and supplies and equipment, the scientific enterprise as a whole (not just a particular research group) should be able to reap maximum benefit.

Most newer astronomical data is public. There's usually a one year where the data can only be accessed by the PI on the original proposal, but after that it's fair game for any one (not just scientists). If you're interested, check out http://www.sdss.org/science/, https://archive.nrao.edu/archive/e2earchive.jsp, http://archive.stsci.edu/index.html

While I can understand patient privacy concerns, getting large amounts of data out there in an accessible format is a solved problem. When you get TBs of data a second is where the problem becomes hard.

From the pharma side: bsci's first paragraph nails it. It's never clear, especially in a study involving multiple countries with different privacy laws, exactly what needs to be elided from data to make them clean. It certainly involves more than just replacing names (which are never in the analyzed datasets anyway, in my experience) with codes, and it's telling that Vickers doesn't know that.

From the point of view of the lawyer or privacy specialist who has to approve the transfer, it's a no-brainer. Downside: the possibility of scandal, lawsuits and enraged clinicians and regulators. Upside: helping some random guy's career.

As for "...really, if you are not keeping your data in order then you're not doing the job you're being paid for" -- submittable data do need to be kept so that the FDA or EMEA can walk in and demand immediately interpretable data. It takes an army of people devoted to that task and imposes considerable overhead on everyone else.

But, presumably, since this is the kind of objection raised by statisticians, a statistician ought to be able to make a reasonable determination about what kinds of hypotheses are properly testable and what kinds of hypotheses are not properly testable given a particular data set generated with a particular experimental protocol.

Regrettably, there's often a gap between what people presumably should be trusted to do and what they're liable to do!

I also want to note that, in my subfield of functional brain imaging, there are several serious efforts to design quality database systems that everyone can use to keep track of their own data and study designs and make it easier to share data. There was even public data repository where a specific journal required researchers who published data in that journal to submit their data. That repository eventually collapsed because it was too time consuming to enter data in an acceptable format.

The NIH has invested a non-trivial amount of money in this for my field. That said, this is really challenging and I've yet to see a system that makes data sharing easy or even reaches a quality level where putting my data in such a system where it would benefit my own work.

I wanted to add this point, to make clear that I respect the need for meticulous data organization, but if multiple R01 or facility grants can't make even a model system that works for clean data storage, it is an very time consuming to make sure very scientist always has every data point in an easily shared format.

No idea how things are run specifically in the US cancer research field, but a colleague was doing some studies on behaviour patterns of people with various developmental disorders (autism, aspberger, adhd and so on ), and from her work I gathered that the subjects specifically agreed to participate in that study. If anybody else wanted the data - or even if she'd get the idea of "reusing" the data for some other purpose she'd have to get new permission from each participant and check in with an ethics review board to get the reuse cleared as well.

Posters above are right about the data as well; there's a world of difference between basically having the data accessible if or when you need to check a particular point for yourself and having the data nicely collated, annotated and converted into a standard format that other people can pick up and use without needing the source to the one-off scripts and matlab functons you've used to run the data for the project (code that, depending on who wrote it and what license it got, may need permission from the writer to send off to another researcher or lab).

All of this is solvable. But it would take a good deal of time and effort on the part of the researcher. Just getting permission to share the data will likely involve hours of time and paperwork, and if permission is needed from the participants it can stretch into weeks of occasional activity, playing phone tag with people and so on. Collating, converting, de-identifying and checking the data will easily eat up a few workdays.

Is Vickers ready to use part of his research grant to compensate the lab for the not inconsequential number of hours he expects them to work on behalf of his project rather than their own? Or is he offering co-athorship on the resulting papers, making them part of the project?

Astronomers have developed FITS (Flexible Image Transport System) to deal with the image format question: http://heasarc.nasa.gov/docs/heasarc/fits.html . With FITS, I can read in 30 year old data and compare it to something taken yesterday (sometimes with a bit of tweaking).

Many papers in ADS (the main astronomy search engine) have on-line data connected to the paper, e.g., http://adsabs.harvard.edu/abs/2000AJ....119.2866B . If you click on the On-Line Data link, you can get the original data tables from this paper from VizieR in a standardized electronic machine readable format.

We may not have the funding of some of the big biomedical projects, but we managed to get it done anyways.

In short, I don't think it's a question of whether its possible or impossible technically to share data, but if the data should be shared (i.e., there are privacy concerns).

However did "keeping your data in order" (my actual words) become a brownshirted bogeyman with his shiny black boots on the throats of poor oppressed scientists everywhere?

I take the points above about difficult datasets, and particularly patient confidentiality. But for every such dataset, there are likely many others that are simply no big deal to share. Pace DrugMonkey, I put forward no opinion in my first post about whether or not most scientists do keep their data properly. In fact, if I had to guess I'd say that most scientists could share most of their data accurately and fairly readily.

Janet's actual question was "Should researchers share data?". The answer according to me is, yes. I'm assuming we're talking about gummint-funded research being shared between gummint-funded labs here, so the obligation to maximize return on investment seems like a no-brainer. And if you're going to share data, you need to keep your records in good order; what's so corrosive about that?

All that other stuff about fraud, that's between DrugMonkey and his shrink. Or maybe his lawyer, I dunno.

But just to return a favor, I'll say that DrugMonkey's tone here smacks heavily of the paternalistic, "scientist-knows-best" attitude that says the public should just fork over the money and then shut up and do as they're told. There's also more than a hint in there of "la la la I've got mine la la la I don't want to know about any problems in the system la la la".

...and from her work I gathered that the subjects specifically agreed to participate in that study.

Exactly! Janet's argument that patients deserve to have the value of their contributions maximized may well be the way most of them feel. But the system is designed to err on the side of privacy and consent; the value of reuse of data can't offset that.

Most newer astronomical data is public. There's usually a one year where the data can only be accessed by the PI on the original proposal, but after that it's fair game for any one (not just scientists).

That (with limitations to address privacy concerns) is how NIH-funded whole-genome analyses are now expected to work, to give one example. That's a situation analogous to astronomy data, with enormous, expensively obtained, difficult datasets. No one disputes that it's possible, just when it's a cost-effective use of people's time.

Bill,
I take the points above about difficult datasets, and particularly patient confidentiality. But for every such dataset, there are likely many others that are simply no big deal to share.

I agree with that, but to finish your point, you need to prove that there is a non-trivial sized population that could easily share their data but doesn't. The original op-ed gives a couple half-anecdotes with minimal information. Have you ever seen cases where people refused to share easily shared data?

Also, back to the original Vickers piece, the potential cancer screening example is interesting, but Vickers is the one in the wrong. Years have passed since his initial query. Did he go back to these people after a year and ask again? After 2 years? Or did he make just one short query then sit on his butt for years as a potential treatment was ignored until he could bash others over the head with the story in a NYTimes op-ed. In this case, he also had an ethical lapse.

brownshirt bill sez: There's also more than a hint in there of "la la la I've got mine la la la I don't want to know about any problems in the system la la la".

I most assuredly want to know about actual problems in the system. I do not want yahoos with an anti-science agenda to gain any more traction than they already have.

If you read my stuff on the denial of drug abuse science you may understand a little better where I am coming from. The upfront normal limitations in our step-by-step limited-scope publication of research progress is used to cast doubt on the whole shebang. Normal limitations to a given paper are represented as showing all of the research is "flawed" as if this means that it is all "wrong".

Knowing what I do about the types of normal day-to-day stuff that goes on in various types of labs, I can make some pretty good guesses about the use to which anti-science types would put such an understanding of lab reality.

I don't know. Maybe nothing ever goes wrong in your work. Maybe everything each and every day lives up to some mythical LabConduct101 standard. Maybe there are never any decision you make, choices of experiments to do or not do that could possibly be misrepresented as "bad science".

In fact, if I had to guess I'd say that most scientists could share most of their data accurately and fairly readily.

I don't agree. Obviously others in this thread don't agree with this premise either.

bsci,

am late for work but off the top of my head: do a quick google for Brian Martinson (lead author of that "scientists behaving badly" article), or skim these links; here's a Nature misconduct focus, and another article about "normal misconduct".

Special disclaimer for DM: I'm not claiming the sky is falling. I'm just saying things aren't as good as they could be, and I think there's a reasonable amount of evidence on my side.

brownshirt bill sez [...]
Maybe nothing ever goes wrong in your work.

OK, now you're just making fun of me.

Of course I screw up; of course someone with a poisonous agenda could mis-represent things I've done as "bad science". The thing is, I'm just not as worried as you are about the anti-science yahoos finding out about what goes wrong in my work. I have only been in the US five years, but I'd have thought that if the anti-science brigade were as dangerous as you

Heh. Stop there. I seem to be taking the same attitude to your concerns about anti-science yahoos that you take to my concerns about procedural and systematic lapses in labs. I can already hear the clatter of your fingers on the keyboard, reminding me what happened to Edie London. Fair enough.

But I think various things above hit your hot buttons, and then you stomped callously on mine, and that took us off on a tangent; fraud was not really the point of Janet's post.

I bet we agree that effective sharing of research data and materials is central to science. I am simply saying that, to the extent that one is paid to do science, one is paid to share, at least with other gummint thinkmonkeys -- and too many researchers don't take this part of their jobs sufficiently seriously.

For those who are interested in privacy, I suggest reading this:
k-anonymity
abstract:
Consider a data holder, such as a hospital or a bank, that has a privately held collection of person-specific, field structured data. Suppose the data holder wants to share a version of the data with researchers. How can a data holder release a version of its private data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful? The solution provided in this paper includes a formal protection model named k-anonymity and a set of accompanying policies for deployment. A release provides k-anonymity protection if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appears in the release. This paper also examines re-identification attacks that can be realized on releases that adhere to k-anonymity unless accompanying policies are respected. The k-anonymity protection model is important because it forms the basis on which the real-world systems known as Datafly, m-Argus and k-Similar provide guarantees of privacy protection.

By anonymous (not verified) on 04 Mar 2008 #permalink

Anonymity can be harder than a k-anonymity structure. For example let's say we're dealing with a brain cancer study and some of the information is MRI images that include tumor sizes, locations, etc. The data includes the entire head. Anyone could take that MRI image make a 3D rendering of it and have a fairly accurate picture of the subjects' face (without hair). How could a k-1 anonymity system possibly work in this case?

In this case, the solution is to edit the actual MRI images to remove the skin thus making it unidentifiable. For a dataset of any reasonable size, this is time consuming and not something a researcher would do unless they where going to share their data with someone not covered within the same HIPPA protocols.

Even if this is done, brain structure is fairly unique. If you had a single image of a known individual's brain, it would be trivial to match it to the brain in a larger anonymous database and pull all additional medical information connected to that person. This example is a stretch, but well within realistic possibility.

Again, bsci explains it nicely (and again, I'd ask what the perceived upside is for a lawyer who couldn't care less about "Mertonian norms").

But details of implementation aside, my understanding of how k-anonymity works (your link doesn't work; I went to this) is that it provides anonymity by introducing noise. In a real clinical dataset, unlike in her toy example, I'd be amazed if the required blurring didn't wipe out the utility of the data.

Even the most rabid Mertonian should understand why I (let alone my employer) might be reluctant to unquestioningly give out data that are guaranteed to yield deviant results!

I think bsci offers an example of why k-anonymity is not the tool to use when the interesting data itself is identifiable (like a fingerprint). If that is the only case you care about - then it probably will not help you.

I would not characterize it as "introducing noise", here is the first sentence in the description from the page you referenced:
Consider a data holder ... that has a privately held collection of person-specific, field structured data.
The basic concept is to redact elements in the dataset until sufficient ambiguity is established.

By anonymous (not verified) on 05 Mar 2008 #permalink

The basic concept is to redact elements in the dataset until sufficient ambiguity is established.

1) The paper has the example of replacing the last digit of zip codes with a *. I suppose maybe that's technically "reducing signal" and not "introducing noise" -- I'm not an information theorist.

2) My point is unchanged regardless of 1).

Things still not considered. I'll stick with MRI data, but this can be true for any expensive medical test that doesn't just send a blood sample to a lab.
Any test needs to list the exact piece of equipment used for data collection. based on 4 digits of a zip code and the specific equipment brand, it should be possible narrow data down to one or two zip codes. With a study that uses multiple pieces of expensive equipment, it would probably be possible to narrow the data collection down to a couple of hospitals without any zip code info (or in reverse, if you know patient X was in hospital Y, you could search for patients who used the equipment in that hospital.

I guess my point with all this is that full anonymity in large studies is virtually impossible unless you keep the data restricted to trusted people which takes work and also means some random statistician doesn't have automatic and rapid rights to any requested data.

it's a question of whether its possible or impossible technically to share data, but if the data should be shared