Redefining the Binomial

There’s an interesting post over at Statistical Modeling, Causal Inference, and Social Science on calculating probabilities. Traditionally, if you observe a certain number of events (y) in some number of trials (n), you would estimate the probability (p) of the event as y/n. To calculate the variance around this estimate, you would use this equation: p(1-p)/n.

This leads to two problems. First, if you never observe the event, your estimate of the probability of the event is zero; if you observe the event in every trial, your estimate is one. This leads to a deterministic model even if the unobserved event is possible. Second, if p is estimated as zero or one, then the estimated variance is zero (once again, suggesting a deterministic model).

To get around these problems, the formula (y+1)/(n+2) is proposed for calculated p. Using this formula, you can never get a probability of zero or one, and the variance will always be greater than zero. There is further discussion of the implications of this calculation at SMCISS.


  1. #1 Janne
    May 16, 2007

    So the chance of me having a three-way with Madeleine Albright and an 17-tentacled alien from Arcturus humming show tunes on the roof of Sogo department store in downtown Osaka is 0.5? Sweet!

  2. #2 Keith Robison
    May 17, 2007

    A related concept is the use of pseudocounts in position-specific weight matrix building — since the matrices are often log-transformed, zero counts are a royal pain — and an overestimate of the improbability of that character appearing at a position. The problem is particularly acute for amino acid matrices — if you have aligned fewer than 20 proteins, then you can’t possibly have seen all amino acids — and at any position at all conserved you need to align many more proteins before you might see them all.

The site is currently under maintenance. New comments have been disabled during this time, please check back soon.