Mike the Mad Biologist

Should any data, not just genomic data, be held hostage by the grant award process?

Hunh? Let me back up…

By way of ScienceBlogling Daniel MacArthur, I came across this excellent post by David Dooling about, among other things, how different genome centers, based on size, have different release policies (seriously, read his post). Dooling writes (boldface mine):

The more interesting question is: why aren’t all data and research released rapidly and freely available? Since the Bermuda Principles were agreed to in 1996, all genome sequencing centers have submitted their data, from raw sequence data to finished sequence to assemblies to annotation, to public repositories as quickly after generation as possible. These principles were reinforced by the Fort Lauderdale agreement in 2003 which added a provision that protected the production centers’ right to first publication. But as we have seen recently, that provision of the Fort Lauderdale agreement is not always enforced. As sequencing has moved into medical applications, the sequencing centers have taken great pains to release human sequence data in a responsible manner, but still rapidly. What’s more, they now also release the detected variants fully annotated and correlated with phenotypic information in protected access databases available to any researcher. As data that requires more and more analysis and significant human curation are made rapidly available well before publication, the production centers become ever more vulnerable to getting “scooped” on their hard won findings.

As Church and Hillier properly conclude in the above referenced article

Sequence data are now easier to produce, but decisions about timelines for data release, publication, and ownership and standards for assembly comparison and quality assessment, as well as the tools for managing and displaying these data, need considerable attention in order to best serve the entire community. (Emphasis mine)

This conclusion begets many questions. If the rapid release described in the Bermuda Principles still holds true, why does it only apply to large-scale sequencing centers? Many researchers are generating more sequence in a month than the Human Genome Project was able to produce in a year. As they continue to be allowed to perform pre-publication (as opposed to post-generation) data submission, why are they not being held to the same standard as the large-scale sequencing centers?

I agree with Dooling, smaller projects and genome centers should have to release data in a timely manner too. But the reason why the double standard exists has to do with the funding incentives for these different groups (Note: I work at a large sequencing center).

For the large sequencing centers, most of the projects are geared towards genome production. That is, the funding agency assesses whether or not benchmarks for sequence (and assembly and annotation) quality have been met in the time frame expected. To put it more crassly, renewal of funding is not primarily determined by manuscript output. Renewal of funding is determined by genome output. Yes, publications by the center are included in renewals, although ‘prestigious’ publications by other groups not associated with the center can also matter. And Dooling is right: often the sequencing centers are the only groups with the bioinformatics and analytical resources and know-how to make sense of the data, so, in reality, the centers end up publishing papers using the data.

But for the large centers, this is essentially contract work: the funding agency has determined that a certain amount of genomic data is required to aid other scientists in one or more disciplines, and the center is obligated to deliver these data. That’s what pays the bills. There’s none of the all-too-typical R01 (or similar grant) progress update “we didn’t deliver what we said we would, but we found this other thing that’s interesting.” If you’re supposed to deliver X genomes, you deliver those genomes, period. In fact, some of these arrangements aren’t even grants, they’re actually federal contracts. Funding agencies learned this the hard way, as too many early sequencing centers resembled ‘genomic roach motels’: DNA checks in, but sequence doesn’t check out.

The smaller centers often do not have these arrangements. The funding agency treats this as a typical research grant. There are specific aims and hypotheses designed to address a particular research goal. But more importantly, these grants are not structured with the expectation that the funded group will rapidly deliver a set of data to a wider community. The incentive structure is that, by the end of the grant, the researchers will have addressed some specific questions. To be crass again, the ability to renew the grant (or leverage it into another grant) is determined by publication output at the time of grant renewal, which can be several years. This creates an incentive to not share data, often to the detriment of the field as a whole.

Ideally, genomic data, once it passes quality control, would be released regardless of the size or scale of the center producing it. But the current reality is that university researchers associated with smaller centers support their careers and their universities’ genome centers through grants that are often awarded based on publications related to generated data.

This, of course, returns us to the question that Dooling poses, “[W]hy aren’t all data and research released rapidly and freely available?” Should any data, not just genomic data, be held hostage by the grant award process? As long as U.S. science is structured around small* academic labs engaged in incredibly tough competition for resources, and those resources are allocated based on publication record, I don’t see how this will change. After all, if you work really hard, only to have ‘your’ data scooped by another group, then ‘open’ data release is unfair to these researchers. On the other hand, if they are judged by data production, we could have open data release policies. But that leads to a whole ‘nother set of problems, which is how one then gets funding to do analysis….

*The genome center I work at has over 1,000 people. A lab with twenty people is nothing…


  1. #1 David Dooling
    June 9, 2009

    Thanks for discussing my post. One small note (no need to approve this comment), my name is David not Daniel. Thanks.