Redefining the Binomial

There's an interesting post over at Statistical Modeling, Causal Inference, and Social Science on calculating probabilities. Traditionally, if you observe a certain number of events (y) in some number of trials (n), you would estimate the probability (p) of the event as y/n. To calculate the variance around this estimate, you would use this equation: p(1-p)/n.

This leads to two problems. First, if you never observe the event, your estimate of the probability of the event is zero; if you observe the event in every trial, your estimate is one. This leads to a deterministic model even if the unobserved event is possible. Second, if p is estimated as zero or one, then the estimated variance is zero (once again, suggesting a deterministic model).

To get around these problems, the formula (y+1)/(n+2) is proposed for calculated p. Using this formula, you can never get a probability of zero or one, and the variance will always be greater than zero. There is further discussion of the implications of this calculation at SMCISS.

Tags

More like this

Marilyn Mann pointed me to an interesting post by David Rind over at Evidence in Medicine (thanks!). It's a follow-on to an earlier post of his about the importance of plausibility in interpreting medical literature, a subject that deserves a post of its own. In fact the piece at issue, "HIV…
After yesterdays post about the sloppy probability from ann coulter's chat site, I thought it would be good to bring back one of the earliest posts on Good Math/Bad Math back when it was on blogger. As usual with reposts, I've revised it somewhat, but the basic meat of it is still the same…
The bulk of this part of the review is looking at the total train-wreck that is chapter 4, which contains Bittinger's version of dreadful probabilistic arguments for why Christianity must be true. But before I do that, I need to take care of one loose end from part 1. I should have included…
I heard it again the other night. One of the TV chin strokers talking about this poll or that poll showing Obama (or McCain) ahead with a "statistically insignificant" lead, and I thought to myself, no one who knew much about statistics would use a phrase like that. Strictly speaking, while there…

So the chance of me having a three-way with Madeleine Albright and an 17-tentacled alien from Arcturus humming show tunes on the roof of Sogo department store in downtown Osaka is 0.5? Sweet!

A related concept is the use of pseudocounts in position-specific weight matrix building -- since the matrices are often log-transformed, zero counts are a royal pain -- and an overestimate of the improbability of that character appearing at a position. The problem is particularly acute for amino acid matrices -- if you have aligned fewer than 20 proteins, then you can't possibly have seen all amino acids -- and at any position at all conserved you need to align many more proteins before you might see them all.