Mike the Mad Biologist

RELEASEDATA.aspx

Well, someone at ScienceBlogs had to draw down on Scientopia, and it might as well be the Mad Biologist. I was going to respond to this post by proflikesubstance about genomics and data release in a calm, serious, and respectful manner, and, then, I thought, “Fuck that. I’m the Mad Biologist. I have a reputation to uphold.”

Anyway, onto genomics and data release. Proflikesubstance writes:

I learned something interesting that I didn’t know the sharing of genomic data: almost all major genomics centers are going to a zero-embargo data release policy. Essentially, once the sequencing is done and the annotation has been run, the data is on the web in a searchable and downloadable format.

Yikes.

Sounds scary. But two points: most NIAID-funded genomes already lack data embargoes, and there has been no final decision about embargoing some of the metadata (these are the data attached to each genome, such as patient status).

Moving along (italics mine):

…obviously no one is going to yank a genome paper right out from under the group working on it, but what about comparative studies? What about searching out specific genes for multi-gene phylogenetics? Where is the line for what is permissible to use before the genome is published? How much of a grace period do people get with data that has gone public, but that they paid for?

That last sentence is incorrect. I’ve discussed the difference between large and small genome centers before, but it’s worth revisiting. At the large centers, which is what I think proflikesubstance is referring to, much of the genome sequencing is not funded by R01s, but by contracts to the centers: the funding does not come from the outside investigator (Large centers do have investigator-driven R01 collaborations, and smaller centers live on these grants. In these cases, the centers often are willing to hold onto the data*).

Several of the large centers, including the one I work at, are funded by NIAID to sequence microorganisms related to human health and disease (analogous programs for human biology are supported by NHGRI). There’s a reason why NIH is hard-assed about data release:

Funding agencies learned this the hard way, as too many early sequencing centers resembled ‘genomic roach motels’: DNA checks in, but sequence doesn’t check out.

The funding agencies’ mission is to improve human health (or some other laudable goal), not to improve someone’s tenure package. This might seem harsh unless we remember how many of these center-based genome projects are funded. The investigator’s grant is not paying for the sequencing. In the case of NIAID, there is a white paper process. Before NIAID will approve the project, several goals have to be met in the white paper (Note: while I’m discussing NIAID, other agencies have a similar process, if different scientific objectives).

Obviously, the organism and collection of strains to be sequenced have to be relevant to human health. But the project also must have significant community input. NIAID absolutely does not want this to be an end-run around R01 grants. Consequently, these sequencing projects should not be a project that belongs to a single lab, and which lacks involvement by others in the subdiscipline (“this looks like an R01″ is a pejorative). It also has to provide a community resource. In other words, data from a successful project should be used rapidly by other groups: that’s the whole point (otherwise, write an R01 proposal). The white paper should also contain a general description of the analysis goals of the project (and, ideally, who in the collaborative group will address them). If you get ‘scooped’, that’s, in part, a project planning issue.

NIAID, along with other agencies and institutes, is pushing hard for rapid public release. Why does NIAID get to call the shots? Because it’s their money.

Which brings me to the issue of ‘whose’ genomes these are. The answer is very simple: NIH’s (and by extension, the American people’s). As I mentioned above, NIH doesn’t care about your tenure package, or your dissertation (given that many dissertations and research programs are funded in part or in their entirely by NIH and other agencies, they’re already being generous†). What they want is high-quality data that are accessible to as many researchers as possible as quickly as possible. To put this (very) bluntly, medically important data should not be held hostage by career notions. That is the ethical position.

This is a data generation mission, not a publication mission:

…for the large centers, this is essentially contract work: the funding agency has determined that a certain amount of genomic data is required to aid other scientists in one or more disciplines, and the center is obligated to deliver these data.

That doesn’t mean that there isn’t a recognition by NIH that researchers involved in the project have made investments; there is. But the ‘donators’ have several advantages: they know what questions they want to answer already, they have had early access to the metadata, and the center they’re collaborating with will often assist them bioinformatically**. And NIAID is willing to embargo metadata (but usually not the sequence) in certain situations.

So what about getting scooped? Well, there is the toothless Ft. Lauderdale Agreement, but it’s not like there’s a team of elite strike force genomics commandos (although if there were, I would so sign up for that). But again, if you know the data are coming, you have to be prepared to seize the moment. Your lab has to move quickly***; if your group can’t, then find a collaborator who can. Ultimately, I think we will have to move to a two-part reward structure, where generation of useful data is rewarded by itself.

Anyway, I guess what me cranky was the idea that these data belong to an individual researcher. The questions we should be working on are too important to be parochial about ownership–and to NIH’s credit, much of the genomics funding reflects this idea.

Update: Proflikesubstance responds.

*It’s a pain in the ass, as the data release processes, especially the raw sequence data files, are heavily automated.

One point that is often overlooked is that the contractual arrangement is not between you and the funding agency, but between your institution and the funding agency. It was never ‘your’ funding.

**The major centers have a basic ‘birth announcement’ informatics pipelines that facilitate getting a basic “here’s what we found” type of paper, so the work that went into DNA extraction, choosing strains, helping write the white paper results in a quick publication.

***To give you an idea of what “quickly” means, in the Human Microbiome Project, a preliminary analysis of data will often occur days after the raw data come off the machines.

Comments

  1. #1 Prof-like Substance
    August 16, 2010

    Maybe you missed the asterisk that I used in a couple of places, which states:

    *Obviously we are talking about grant-funded projects, so the money is tax payer money not any one person’s. Nevertheless, someone came up with the idea and got it funded, so there is some ownership there.

    I am fully aware of how the system works, but I think you’re taking an overly narrow view here. I’m not talking about data with immediate medical relevance, although I know these are the only data NIHers think of. It is true that lots of people study things that don’t have immediate medical impact. I know, I’ve looked it up.

    For those working on the frontiers of organismal groups where there currently is NO genomic data, the issue of data embargo are enormous because there is a lot of data to assess (since much of it is novel, the automated methods of annotation don’t work so well and need to be trained or other work around) and because of the interest many have in extending a story to include another group of organisms. The analysis takes longer because there is no ‘reference’ to use and the issue of de novo takes on a different meaning. At the same time, cherry picking a certain story is no harder.

    What I can’t understand is what is wrong with, say, a three month investigator exclusivity to data produced by ‘large’ centers that are directly funded? JGI started with a year embargo, then six months, and now none. One can argue that they did this in response to slow moving projects, but it looks from the outside to be mandated from above. With the decreasing cost of genomics, people can do more ‘in house’ than ever before. What is the motivation going to be in a year or two to have free data you have no control over?

The site is currently under maintenance and will be back shortly. New comments have been disabled during this time, please check back soon.