The NIH Grant Review Process Is Statistically Unsound?

Who woulda thunk it? A recent paper in PLoS One argues that the NIH review process uses far too few reviewers to claim the level of scoring precision that the NIH provides.

NIH grants are scored on a scale from 1.0 to 5.0, with 1.0 being the best; reviewers can grade in tenths of a point (i.e., 1.1, 2.3, etc.). The authors, using some very straightforward statistics, demonstrate that four reviewers could accurately assign whole integer scores (1, 2, 3...), but to obtain reliable scores with a precision of 0.01, a proposal would require 38,416 reviewers.

Not going to happen. Keep in mind that NIH is considering moving to scores with a supposed precision of 0.001. The authors note:

The disconnect between the needed precision in order to allocate funds in a fair way and the number of reviewers required for this level of precision demonstrates a major inconsistency that underlies NIH peer review. With only four reviewers used for the evaluation of applications, an allocation system that requires a precision level in the range of 0.01 to 0.1 is not statistically meaningful and consequently not reliable. Moreover, the 4 reviewers NIH proposes are not independent which degrades the precision that could be obtained otherwise.

Consequently, NIH faces a major challenge. On the one hand, a fine-grained evaluation is mandated by their review process. On the other hand, for such criterion to be consistent and meaningful, an unrealistically high number of evaluators, independent of each other, need to be involved for each and every proposal.

They also argue that the inappropriately small numbers of reviewers is stifling novel proposals:

...4 independent evaluators can provide statistical legitimacy only under the circumstance of all evaluators giving essentially the same evaluation. For proposals that are expected to be more controversial, as potentially transformative ideas have been proposed to be, a small number of evaluators would lead to unreliable mean estimates.

In the conclusion, there's some pretty good snark (italics mine):

It is commonly accepted that NIH will not fund clinical trials that do not include a cogent sample size determination. It is ironic that NIH insists on this analysis for clinical studies but has not recognized its value in evaluating its own system of peer review. We posit that this analysis should be considered in the revisions of NIH scientific review.

The NIH peer review structure has not been based in rigorous applications of statistical principles involving sampling. It is this deficiency that explains the statistical weakness and inconsistency of NIH peer review.

My only quibble with this article is that the scores that might be funded typically range from 1.0-1.4 (although, like high school grades, there has been significant 'grade inflation'), and I'm not sure what that does to some of the estimates. Granted, needing 'only' hundreds of reviewers isn't comforting either.

One proposed solution is to radically shorten proposals to one or a few pages, so the number of reviewers can increase. Before you think this is crazy, the 'meat' of genomics white papers is typically only a few pages long (the rest is usually a discussion of how the sequencing will be done, which presumably the major sequencing centers have figured out by now).

I've always though that proposals are like students applying to a 'highly selective' college: you kick the bottom two-thirds, there is a small number of really qualified students that you obviously want, and the rest are pretty interchangeable (not that you want to tell the customersstudents that...). My solution would be to keep the current process, triage the bottom sixty percent, and then randomly pick the remainder, with the exception of any proposal that was scored in the top ten percent of all reviewers assigned to that grant.

Although I do like the idea of one page proposals....

Cited articles: Kaplan D, Lacetera N, Kaplan C (2008) Sample Size and Precision in NIH Peer Review. PLoS ONE 3(7): e2761. doi:10.1371/journal.pone.0002761

Kaplan, D. 2007. POINT: Statistical analysis in NIH peer review--identifying innovation. The FASEB Journal 21:305-308.

More like this

A while ago, I wrote, "Someday, a science reporter is going to hybridize with an economics reporter and then the topic of how science is funded will actually be covered accurately. Until then, you're stuck with the Mad Biologist." Well, I don't know if the hybridization experiment has been…
Today I got several emails, each asking for my views on a proposed change to the format for National Institutes of Health grant proposals. This may seem of only parochial interest except to those of us who make our living applying for NIH grants, but how health research is funded is of interest to…
There are 51 new papers in PLoS ONE this week - check them out for stuff you are interested in (and post comments, notes and ratings and send trackbacks), but here are my personal picks: Sample Size and Precision in NIH Peer Review: The Working Group on Peer Review of the Advisory Committee to the…
One of the favorite targets of pseudoscientists is the peer review system. After all, it's the system through which scientists submit their manuscripts describing their scientific findings or their grant proposals to their peers for an evaluation to determine whether they are scientifically…

There seems to be something missing in that article. They note that the proposal scores are graded with steps of 0.01, but it is unclear whether the NIH takes steps that small as hard grant cutoffs. From what I understand (I've never been involved in this process), closely scored proposals are not always chosen in strict numerical order. For example, a proposal in an understudied area, might get a small boost despite a slightly lower score.

Second, does anyone claim that all proposals can be ranked by quality in any purely linear manner? Other factors are also involved. Even with 38K reviewers, the scores would depend greatly on the instructions given to the reviewers.

I don't think there will ever be a way to remove all the noise from the system. I see the ultimate goal is to make sure all truly amazing studies get funded and to design some unbiased system to send the rest of the money to some subset of good/great proposals. Until you can explain to me exactly how a 1.3 rating is great science while a 1.4 deserves zero money, I don't see the point in worrying about fine numerical accuracy.

Thanks for pointing this out! I can think of a dozen, no wait, a few dozen people who would like read this (and then chuckle). /applaud

Yet one more reason why meritocracies are impossible in practice.

By Michael Schmidt (not verified) on 28 Jul 2008 #permalink

Not having submitted proposals to NIH in the past (nor, as a marine biologist, ever likely to), I don't know whether there's a pre-proposal screen in place. In that scenario, the original RFP calls for one- or two-page pre-proposals, and you then select only a subset for full proposals -- it's yet another way to screen out the avalanche of submissions in still a meritocratic manner.

By FishGuyDave (not verified) on 29 Jul 2008 #permalink

Reading the article I have the impression that is the sour reaction of somebody that didn't get their grant funded by NIH.
Honestly I have been astonished by figure #3 of such article: especially the lower-right panel is astonishing: the sample is composed by 40 voters. If you take a "sub-sample" of 10-12 voters, you end up with an average score of 5, whereas if you take all 40 voters (including the ones voting 5) you end up with an average score of 1.
I would really need the authors to explain to me how that could happen: I naively thought that: (10 x 5 + 30 x 1) / 40 = 80 / 40 = 2

On the other side, here is how a NIH grant review panel works: all the members of the panel have access to all the grant proposals and have to review and score a portion of them as primary, secondary or discussant reviewer. Before the meeting, grants that have received unanimously bad scores are eliminated from the process. However, any member of the panel could request that a given grant would be brought back on the table for discussion.
Then all the "scored grants" are discussed one at the time: primary, secondary and discussant reviewer describe the proposal and express their comments. There is a discussion among the panel and each of the reviewer has to express their revised voting. Any other member of the panel, then votes the proposal in a range between the lowest and highest vote expressed by the reviewer. It is possible to vote outside this range but, if the disagreement is greater than 0.3 or 0.5 points, the voter that votes outside the range need to publicly justify the reason, since that shouldn't happen after the discussion process.
The final score is the average of all the votes.
Any review process is very subjective and far away from precise science. A big challenge is rather to have a review panel composed by members that, with their expertises, can cover all the science proposed in the applications.
Honestly I am not so sure that having each grant reviewed by one more person would help too much. Definitely, each reviewer would appreciate having shorter grants, well-written and well structured.

The sad reality is that, in general, there are more grant proposal deserving to be funded than funds available. Maybe, the best way to do, after the panel review process, would be to randomly select the proposal amongst the best scoring ones.
At least you would have to blame the bad luck, rather than a bureaucrat if your grant didn't get funded...

By Paolo Amedeo (not verified) on 29 Jul 2008 #permalink

Any review process is very subjective and far away from precise science. A big challenge is rather to have a review panel composed by members that, with their expertises