Well, someone at ScienceBlogs had to draw down on Scientopia, and it might as well be the Mad Biologist. I was going to respond to this post by proflikesubstance about genomics and data release in a calm, serious, and respectful manner, and, then, I thought, “Fuck that. I’m the Mad Biologist. I have a reputation to uphold.”
Anyway, onto genomics and data release. Proflikesubstance writes:
I learned something interesting that I didn’t know the sharing of genomic data: almost all major genomics centers are going to a zero-embargo data release policy. Essentially, once the sequencing is done and the annotation has been run, the data is on the web in a searchable and downloadable format.
Sounds scary. But two points: most NIAID-funded genomes already lack data embargoes, and there has been no final decision about embargoing some of the metadata (these are the data attached to each genome, such as patient status).
Moving along (italics mine):
…obviously no one is going to yank a genome paper right out from under the group working on it, but what about comparative studies? What about searching out specific genes for multi-gene phylogenetics? Where is the line for what is permissible to use before the genome is published? How much of a grace period do people get with data that has gone public, but that they paid for?
That last sentence is incorrect. I’ve discussed the difference between large and small genome centers before, but it’s worth revisiting. At the large centers, which is what I think proflikesubstance is referring to, much of the genome sequencing is not funded by R01s, but by contracts to the centers: the funding does not come from the outside investigator (Large centers do have investigator-driven R01 collaborations, and smaller centers live on these grants. In these cases, the centers often are willing to hold onto the data*).
Several of the large centers, including the one I work at, are funded by NIAID to sequence microorganisms related to human health and disease (analogous programs for human biology are supported by NHGRI). There’s a reason why NIH is hard-assed about data release:
Funding agencies learned this the hard way, as too many early sequencing centers resembled ‘genomic roach motels’: DNA checks in, but sequence doesn’t check out.
The funding agencies’ mission is to improve human health (or some other laudable goal), not to improve someone’s tenure package. This might seem harsh unless we remember how many of these center-based genome projects are funded. The investigator’s grant is not paying for the sequencing. In the case of NIAID, there is a white paper process. Before NIAID will approve the project, several goals have to be met in the white paper (Note: while I’m discussing NIAID, other agencies have a similar process, if different scientific objectives).
Obviously, the organism and collection of strains to be sequenced have to be relevant to human health. But the project also must have significant community input. NIAID absolutely does not want this to be an end-run around R01 grants. Consequently, these sequencing projects should not be a project that belongs to a single lab, and which lacks involvement by others in the subdiscipline (“this looks like an R01” is a pejorative). It also has to provide a community resource. In other words, data from a successful project should be used rapidly by other groups: that’s the whole point (otherwise, write an R01 proposal). The white paper should also contain a general description of the analysis goals of the project (and, ideally, who in the collaborative group will address them). If you get ‘scooped’, that’s, in part, a project planning issue.
NIAID, along with other agencies and institutes, is pushing hard for rapid public release. Why does NIAID get to call the shots? Because it’s their money.
Which brings me to the issue of ‘whose’ genomes these are. The answer is very simple: NIH’s (and by extension, the American people’s). As I mentioned above, NIH doesn’t care about your tenure package, or your dissertation (given that many dissertations and research programs are funded in part or in their entirely by NIH and other agencies, they’re already being generous†). What they want is high-quality data that are accessible to as many researchers as possible as quickly as possible. To put this (very) bluntly, medically important data should not be held hostage by career notions. That is the ethical position.
This is a data generation mission, not a publication mission:
…for the large centers, this is essentially contract work: the funding agency has determined that a certain amount of genomic data is required to aid other scientists in one or more disciplines, and the center is obligated to deliver these data.
That doesn’t mean that there isn’t a recognition by NIH that researchers involved in the project have made investments; there is. But the ‘donators’ have several advantages: they know what questions they want to answer already, they have had early access to the metadata, and the center they’re collaborating with will often assist them bioinformatically**. And NIAID is willing to embargo metadata (but usually not the sequence) in certain situations.
So what about getting scooped? Well, there is the toothless Ft. Lauderdale Agreement, but it’s not like there’s a team of elite strike force genomics commandos (although if there were, I would so sign up for that). But again, if you know the data are coming, you have to be prepared to seize the moment. Your lab has to move quickly***; if your group can’t, then find a collaborator who can. Ultimately, I think we will have to move to a two-part reward structure, where generation of useful data is rewarded by itself.
Anyway, I guess what me cranky was the idea that these data belong to an individual researcher. The questions we should be working on are too important to be parochial about ownership–and to NIH’s credit, much of the genomics funding reflects this idea.
Update: Proflikesubstance responds.
*It’s a pain in the ass, as the data release processes, especially the raw sequence data files, are heavily automated.
†One point that is often overlooked is that the contractual arrangement is not between you and the funding agency, but between your institution and the funding agency. It was never ‘your’ funding.
**The major centers have a basic ‘birth announcement’ informatics pipelines that facilitate getting a basic “here’s what we found” type of paper, so the work that went into DNA extraction, choosing strains, helping write the white paper results in a quick publication.
***To give you an idea of what “quickly” means, in the Human Microbiome Project, a preliminary analysis of data will often occur days after the raw data come off the machines.