A colleague of mine (who has time to read actual printed-on-paper newspapers in the morning) pointed me toward an essay by Andrew Vickers in the New York Times (22 January 2008) wondering why cancer researchers are so unwilling to share their data. Here’s Vickers’ point of entry to the issue:
[A]s a statistician who designs and analyzes cancer studies, I regularly ask other researchers to provide additional information or raw data. Sometimes I want to use the data to test out a new idea or method of statistical analysis. And knowing exactly what happened in past studies can help me design better research for the future. Occasionally, however, there are statistical analyses I could run that might make an immediate and important impact on the lives of cancer patients.
You’d think cancer researchers would welcome collaborations with statisticians, since statisticians are the ones trained to work out what the data show, and with what confidence. Moreover, once data have been collected, you’d think cancer researchers would want to make sure that the maximum amount of knowledge be squeezed out of them — bringing us all closer to understanding the phenomena they’re studying.
As Vickers tells it, cancer researchers seem to have other concerns they find more pressing, since his requests for data and other sorts of information are often refused. Among the reasons the researchers give to keep their data to themselves:
- Data sharing would violate patient privacy.
- It would be too difficult to put together a data set.
- The analysis of the data might “cast doubt” on the results of the researchers who collected the data.
- The person asking for the data might use “invalid methods” to analyze it.
- The researchers being asked for the data “might consider a similar analysis at some point in the future”.
Vickers responds to these:
- It possible (and not even very hard) to replace patient names with codes to protect patient privacy.
- One usually has to put together a data set in order to publish one’s results — so why would sharing data after you’ve published require putting together another data set?
- It’s a statistician’s job to recognize valid and invalid methods for data analysis, and the scientific community would certainly weigh in with its judgment in case the statistician made a bad call.
- The researchers who said they might want to do a similar further analysis of their data to the one Vickers proposed haven’t yet, years later.
As for whether a further analysis could cast doubt on researchers’ results, I would have thought this falls pretty squarely into the “self-correcting nature of science” bin — which is to say, job one is to get results that are accurate reflections on the phenomena you’re studying. If your initial results don’t get the reality quite right, shouldn’t you welcome a reanalysis that closes that gap?
(Of course, I suppose there are cases in which the worry is that the reanalysis will cast doubt on one’s claim actually to have performed the research one has reported. This is another sort of situation where science is touted as being self-correcting — and where clinging too tightly to one’s data might be a clue to a bad conscience.)
The Vickers essay makes the case that, especially in cancer research, real cancer patients may be the ones most hurt by researchers’ territoriality with their data. The results that might be drawn from the data are not mere abstractions, but could hold the key to better treatments or even cures.
Are there sound scientific reasons not to share such data with other researchers?
In a post of yore, I wrote:
[W]e want to make sure that the conclusions we draw from the data we get are as unbiased as possible. Looking at data can sometimes feel like looking at clouds (“Look! A ducky!”), but scientists want to figure out what the data tells us about the phenomenon — not about the ways we’re predisposed to see that phenomenon. In order to ensure that the researchers (and patients) are not too influenced by their hunches, you make the clinical trial double-blind: while the study is underway and the data is being collected, neither study participants nor researchers know which participants are in the control group and which are in the treatment group. And, at the end of it all, rather than just giving an impression of what the data means, the researchers turn to statistical analyses to work up the data. These analyses, when properly applied, give some context to the result — what’s the chance that the effect we saw (or the effect we didn’t see) can be attributed to random chance or sampling error rather than its being an accurate reflection of the phenomenon under study?
The statistical analyses you intend to use point to the sample size you need to examine to achieve the desired confidence in your result. It’s also likely that statistical considerations play a part in deciding the proper duration of the study (which, of course, will have some effect on setting cut-off dates for data collection). For the purposes of clean statistical analyses, you have to specify your hypothesis (and the protocol you will use to explore it) up front, and you can’t use the data you’ve collected to support the post hoc hypotheses that may occur to you as you look at the data — to examine these hypotheses, you have to set up brand new studies.
I’ve added the bold emphasis to highlight the official objection I’ve heard to data mining — namely, that using data to test post hoc hypotheses is to be avoided. But, presumably, since this is the kind of objection raised by statisticians, a statistician ought to be able to make a reasonable determination about what kinds of hypotheses are properly testable and what kinds of hypotheses are not properly testable given a particular data set generated with a particular experimental protocol. Besides, Vickers says that in large part, he’d be using the requested data sets to test-drive new methods and to plan better protocols for future experiments — so these are cases where the data would be used to test hypotheses about methodology rather than hypotheses about cancer treatments.
And surely, there seem to be other good reasons to lean towards sharing data rather than not. For one thing, there’s that whole norm of “communism”, the commitment scientists are supposed to have that the knowledge is a shared resource of the scientific community. (The norm of organized skepticism might also make sharing or data, rather than secrecy about data — at least after you’ve published your own conclusions about them — the natural default position of the community.) For another thing, the funders of the research — whether the federal government or a private foundation, or even a pharmaceutical company — presumably funded it because they have an interest in coming up with better treatments. Arguably, this happens faster and better with freer communication of results and better coordination of the efforts of researchers.
And then there are the human subjects of the research. They have undertaken a risk by participating in the research. That risk is supposed to be offset by the benefits of the knowledge gained from the research. If that data sits unanalyzed, the benefits of the research are decreased and the risks undertaken by the human subjects are harder to justify. Moreover, to the extent that sitting on data instead of sharing it requires other researchers to go out and get more data of their own, this means that more human subjects are exposed to risk than might be necessary to answer the scientific questions posed in the research.
As Vickers notes, though, the researchers’ proprietary attitude toward their data is not mysterious given the context in which their careers are judged:
[T]he real issue here has more to do with status and career than with any loftier considerations. Scientists don’t want to be scooped by their own data, or have someone else challenge their conclusions with a new analysis.
If sharing your data means you give another scientist an edge over you in the battle for scarce resources (positions, funding, high impact publications, etc.), of course you’re going to think it’s a bad idea to share your data. To do otherwise is to put yourself in a position where you might not get to do science at all — and then what good will you be to the cancer patients your research might help?
But the scientific ecosystem hasn’t been this way forever. Maybe instances where what’s good for a scientist’s career runs counter to what’s good for the people your scientific research is supposed to help are reason enough to consider how that scientific ecosystem could be different.