Statistics, damn statistics and well kept secrets

By revere on December 4, 2009.

Marilyn Mann pointed me to an interesting post by David Rind over at Evidence in Medicine (thanks!). It's a follow-on to an earlier post of his about the importance of plausibility in interpreting medical literature, a subject that deserves a post of its own. In fact the piece at issue, "HIV Vaccines, p values, and Proof," raises lots of interesting questions about different ways to view what we mean by probability (the nub of the frequentist/Bayesian debate), the difference between proof and evidence, the thorny and mental cramp produced by the question of multiple comparisons, and finally this observation:

I've been struck since the beginning of my medical training that biostatistics is an area of mathematics where people can have remarkably different opinions about what is or is not not resonable. I don't usually think of "opinion" and "mathematics" falling in the same sentence, but perhaps that just shows my lack of expertise in theoretical math. (David Rind, Evidence in Medicine)

Rind gives an essentially correct version of what a p-value is, at least from the frequentist perspective. Well it's close enough for practical statistics, anyway. Here's what he says about a difference between HIV infections in a randomized, placebo-controlled trial of a vaccine that showed a difference in the two groups that had a p-value of .04 (4%):

The analysis that showed the statistically significant benefit (p = 0.04), was a reasonable one. There were 56 people who developed HIV in the vaccine arm, and 76 in the placebo arm, but seven patients were found to be infected with HIV at the time the vaccine was administered, and this analysis excluded those patients. Before we come back to that decision, we should keep in mind what that p value means -- what it is that has a 4% chance of having happened?

Although this causes untold confusion for physicians at all levels of training, that p value says that (excluding any problems with the design or performance of the trial), if vaccine were no better than placebo we would expect to see a difference as large or larger than the one seen in this trial only 4 in 100 times. This is distinctly different from saying that there is a 96% chance that this result is correct, which is how many people wrongly interpret such a p value.

[NB: a reader says these numbers are for intention to treat and give a different p-value, but the point was about p-values, not the vaccine paper, so just assume the .04 number as given in Rind's post is correct or change the p-value to .08; what the p-value means is the same].

It's a little hard to say how Rind thinks of probability, since this is a distinctly frequentist formulation although at other points he talks like a Bayesian. You don't have to follow that difference at this juncture, but I'd like to unpack his version a bit more so you can see some of the machinery under it.

p-values are usually associated with the frequentist perspective. They are the product of some statistical test (I'm not sure which one was used in the paper, but it doesn't matter for purposes of this discussion). Statistical tests these days can be very sophisticated mathematically but most have the same kind of underlying idea. For a frequentist, probability is synonymous with the idea of the long run frequency of an event, hence the term "frequentist." The classic example is flipping a coin and counting the frequency (number of times) it comes up heads or tails. If you flip a fair coin 10 times you might get 6 heads and 4 tails by chance. The frequency of heads is thus 60%, but we wouldn't say from so little data this proves the coin was unfair, with a 60% probability of coming up heads. If we wanted a better idea of the coin's "true" probability of coming up heads we'd flip it many more times, say 1000. Or 10,000. The more times we flip it the better fix we get on the true probability of it coming up heads (which for a coin that was truly "fair" would be 50%). If we imagine flipping a coin over and over and measuring the frequency of heads as we go along, it should approach the true probability as the number of flips gets very, very large. If we perform the mathematical magic of increasing the number of flips without limit and look to see what relative frequency is approached closer and closer, that would be the definition of true probability for a frequentist (Bayesians have a completely different notion of probability based on degree of belief, confidence or, if you want to spell it out, how much they are willing to bet on one outcome or the other as a measure of degree of belief).

It's a little hard to see what the frequentist perspective has to do with the vaccine trial at first. After all, we aren't doing the trial over and over again. We are only doing it once. We have to imagine something like this. We have a giant (actually infinite) population of people at risk of contracting HIV. Let's pull out 16,000 people (the number is from the vaccine trial) and give half the vaccine and leave half without vaccine. We assume that each of these people has a certain probability of contracting HIV in 5 years. So in 5 years we count up the number of HIV infected in each group (which in the case of the vaccine paper was 56 for the vaccinated group and 76 for the unvaccinated group). Imagine each infection as a biased coin flip with the chance of coming up heads equal to the chance of being infected. If the chance of getting HIV were exactly the same in both groups, you might still see a different number of infections by chance, just as you might see different number of heads and tails by chance in two sets of 10 coin flips. The more people (the more times you flip the coin), the better the true probability of infection will be given in each group, which is why small clinical trials are less informative than large ones.

Before getting to the p-value, let me emphasize something Rind mentions parenthetically but is crucially important: we must exclude "any problems with the design or performance of the trial." The statistical testing assumes an idealized situation, in particular one where there are no errors in how things were set up or the experiment performed (crudely put, you can't use the equivalent of broken meters and expect sophisticated statistical testing to give you the right answer). The set-up needs to be as much like an idealized coin tossing-like situation as possible, and departures from it will give incorrect answers. Of course the imagined idealization doesn't have to be like coin tossing. It could be like picking different colored marbles from multiple urns or tossing a tie with 6 sides (or 12 sides or 13 sides) or more elaborately, things that are based on underlying physical principles that tell you what kind of probabilities to expect. The idealized set-up is referred to as the underlying probability model.

With the help of an underlying probability model we can use mathematics to calculate how frequently we would get 56 infected in one group and 76 in the other if the vaccine really did nothing. We express that frequency as a p-value. As Rind correctly puts it, a p value of .04 means that if we did this over and over again and the vaccine really did nothing we could still expect a difference as big or bigger than 56 vs 76 to occur less than 4% of the time. Since it is a frequentist probability, it requires imagining performing the trial over and over again. We don't need to do this in reality if we have an underlying probability model because we can calculate what would happen if we did do it in reality. For example, if we assumed a coin was fair or say biased to come up heads 60% of the time or whatever we might choose, we could use mathematics to calculate how likely 6 heads out of 10 would be without actually doing the coin flipping. That's what a statistical test is doing in the vaccine trial but it requires probability model assumptions so we don't have to do the trial over and over again in the real world to get the probabilities in the two groups (remember what probability means to a frequentist).

Notice how many things are at play, here. There is the correct identification of a population all of whom are at risk of developing HIV and that risk is essentially the same (there are sophisticated ways to adjust for differences in people but let's keep it simple here; while adjustments can be made, they require additional assumptions so you haven't eliminated them, just used different ones). Said another way, you correctly identified the at risk population and the underlying probability model that lets you make the test calculation. In addition, you didn't make any systematic mistakes (something that affected one group more than the other) in whom you chose to study, which ones to get the vaccine, how you gave them the vaccine, how you diagnosed HIV, etc., etc. In other words, performance of the study had no systematic errors (if the errors were random you would just need more people for an equally valid study). Finally, the (frequentist) notion of probability is correct. There are some real intellectual battles here, not the least of which involves the idea of that infinite population you had to imagine. I'll come back to these battles shortly.

The interpretation in this case is that the difference that was seen would be pretty uncommon if the probability of HIV infection were the same in both groups, where uncommon means only 4% of the time. It could happen but isn't likely (a Bayesian would say he/she wouldn't bet on it without some pretty good odds). But as Rind points out there are other complications here. There were those excluded people. This gives rise to all sorts of arguments over whether they should have been included in the analysis (called an "intention to treat" analysis because the intention was to vaccinate and count them until it was discovered they already were HIV positive) and whether doing several analyses where they were excluded or included means that you have to alter the calculation of the p-value because you are now making multiple comparisons and that changes things. For some of these problems there are no clear cut answers which is another reason among many why just because something has been subjected to a randomized clinical trial (RCT) it is not the answer to everything. RCTs are difficult to do and full of pitfalls in execution, analysis and interpretation.

Still, the logic of these studies is fairly simple: either a particular hypothesis is false (e.g., that there is a difference between vaccinated and unvaccinated groups); OR, something unusual has occurred. p-values are used to standardize studies so we can compare different reports. p-values that are small signify unusual events, where the definition of "unusual" is left to convention. In medical studies 5% is the most common definition, although in physics 10% is also frequent. If the outcome of the experiment is considered unusual in this sense, we say it is "statistially significant."

Unfortunately the word "significant" in "staistically significant" is frequently misunderstood in the more colloquial sense of "important." It doesn't meant that. It just means that something unusual occurred relative to what you thought would occur, given a particular hypothesis (e.g., that the vaccine doesn't make a difference). It's just the beginning, not the end of the process of putting the evidence from the study into some kind of context. That's where much of Rind's two posts concentrate their attention, and there is much to say about that, too, but this post is already too long. Still, I can't refrain from making one more comment prompted by his excellent piece.

The fundamental difficulties created by things like the meaning of probability or what randomization does in an RCT (we didn't talk about that and it is surprisingly controversial), creates all sorts of confusions, some not even recognized by investigators adhering to different views or using statisticians who do. This comes as a surprise to most scientists. If you have studied statistics in text books you'd never know it, which is why Rind, too, was surprised:

"I don't usually think of "opinion" and "mathematics" falling in the same sentence, but perhaps that just shows my lack of expertise in theoretical math."

It's not a matter of expertise as much as the fact that statistics is a discipline riven by faction is a well kept secret. This isn't just about frequentists and Bayesians, but Fisherians and Neyman-Pearsonians, logical probabilists, likelihood advocates and more. In his book, Statistical Inference, Michael Oakes says this (h/t D.O.):

It is a common complaint of the scientist that his subject is in a state of crisis, but it is comparatively rare to find an appreciation of the fact that the discipline of statistics is similarly strife-torn. The typical reader of statistics textbooks could be forgiven for thinking that the logic and role of statistical inference are unproblematic and that the acquisition of suitable significance-testing recipes is all that is required of him. (Oakes, Statistical Inference, Epidemiological Resources, Inc., Chestnut Hill 1990)

Oakes goes on to quote a book review of a statistics text in a technical journal (the reviewer is Dusoir) :

"A more fundamental criticism is that the book, as almost all other elementary statistics texts, presents statistics as if it were a body of coherent technical knowledge, like the principles of oscilloscope operation. In fact statistics is a collection of warring factions, with deep disagreements over fundamentals, and it seems dishonest not to point this out."

Both of Rind's posts bring up a large number of interesting issues like this. Read them here and here (links repeated for your convenience). At some point I hope I'll have time to take a further look at them.

More like this

Thinking about confidence intervals

Like David Rind over at Evidence in Medicine I'm a consumer of statistics, not a statistician. However as an epidemiologist my viewpoint is sometimes a bit different from a clinician's. As a pragmatic consumer, Rind resists being pegged as a frequentist or a Bayesian or any other dogmatic…

Political polling

I heard it again the other night. One of the TV chin strokers talking about this poll or that poll showing Obama (or McCain) ahead with a "statistically insignificant" lead, and I thought to myself, no one who knew much about statistics would use a phrase like that. Strictly speaking, while there…

A Critical Cause of the Decline Effect: When Weak Effects Meet Small Sample Size

A couple of weeks ago, Jonah Lehrer wrote about the Decline Effect, where the support for a scientific claim often tends to decrease or even disappear over time (ZOMG! TEH SCIENTISMZ R FALSE!). There's been a lot of discussion explaining why we see this effect and how TEH SCIENTISMZ are doing ok…

Not All in our Genes, The Sequel

Last November a WHO study "stated" there was evidence a genetic factor was at work in the susceptibility to H5N1 because it appeared an abnormally high number of reported clusters involved only blood relatives. At the time I expressed some polite skepticism (Not All in Our Genes). Whether the…

Good post. I would add one additional point.

It's the problem of multiple parallel experiments. There have been dozens of HIV vaccine trials. All previous such trials have failed. At the p=.05 threshold for publication, you'd expect that even if none of the vaccines worked at all, 1/20th of those trials would show a positive result. So it's not at all surprising that this one trial showed a modest benefit. What *must* happen is that this particular trial be replicated successfully. You can't draw a reliable conclusion otherwise.

Harlan: Well, things start to get a little weird when you think this way, at least from the conventional frequentist perspective. If multiple experiments affect that p-value (which is what you are saying, in essence), then every time you do an experiment you have to go back and correct the p-values of all the other papers. Every study is considered part of one large experiment. Multiple comparisons is one of those things that seem to make sense until you start to think really hard about them. This is one of the subjects that need to be discussed further in Rind's posts.

I think statisticians are more sanguine than you make out. Sure there are different schools, but most of us get along, and are happy to be different. Or we just use whatever approach works for the problem.

P-values are a common cause of concern: we all know how awful they are, but this hasn't filtered down to most biologists (and this is partly out fault: we should probably be more pro-active). To be honest, this isn't a frequentist/Bayesian thing: one can calculate Bayesian p-values, and they suffer the same evils.

I don't think Harlan's comment implies a need to correct previous p-values. The p-value for each experiment is only addressing the statistical significance of that one experiment (assuming each experiment used a frequentist approach, as they almost certainly did).

I think Harlan is moving up a level and asking more about some implied p-value for all the experiments considered as a group. Something like: "If HIV vaccines don't work, what is the likelihood of seeing one positive trial out of dozens of total trials?" Basically a meta-analysis type of view.

Bob: I both agree and disagree. Most practicing statisticians don't think much about the underlying fundamental issues, any more than a lot of physicists think about quantum logic or the meaning of causality. And the text books and training aid and abet that. Today's statistics has two parents: the Golden Age of classical inference methods of Student, Fisher, Neyman, Pearson, etc.; and probability theory underpinnings of Kolmogorov and others. They are never melded well in training or in practice. There seems to be little relationship between the (rudimentary) math stats many biostatisticians learn (do they really care about measure theory? I don't think so, because it doesn't have any effect on what they do and they quickly forget it, if they ever understood it in the first place). The use of modern computing has enabled all sorts of sophisticated and complex operations and the lack of theoretical unity is producing cracks.

So while it may be true that for most biostatistical practice (design of experiments or helping some bench scientist do ANOVA with some post hoc tests) this isn't a big deal, for some things it is. We see this more and more in epidemiology where the differences affect high profile clinical trials and in the development of new theoretical methods to underpin epidemiology itself (which has had no theoretical basis), e.g., in so-called Bayesian networks. Physics seem to be getting along well at the end of the 19th century, too, with only details left to be filled in. But it all came apart in 1900. We already see many discordant results between RCTs and between RCTs and observational studies and some of us think these anomalies are a signal of a theoretical crisis in epidemiology. We'll just have to see how this works out.

Bob O'H: I'm a biologist who teaches a little biostats course, and I try to emphasize to my students that statistics is an area of active research and controversy, with a variety of competing approaches to the same kind of questions. My understanding of most of the areas of controversy is too dim to really explain all sides, however, so maybe you can help me out. In the "modified intention to treat" analysis of the clinical trial under discussion, 74 out of 7325 people in the placebo group became infected with HIV, while 51 out of 7347 in the vaccine group became infected. A simple chi-squared test of independence of these numbers yields P=0.037; the assortment of tests used in the paper yield similar P-values of 0.03 to 0.05. What would be the non-awful, non-P-value way that you would analyze these data?

I'd calculate the odds ratio: 74/(7325-74) / 51/(7347-51) = 1.46. So the placebo group is 1.46 times as likely to become infected. Then present the 95% confidence interval: 1.007 to 2.13.

The question now is whether you think being twice as likely to be infected is important: if you don't, the conclusion is that there is no effect of practical significance. If you do, then you can conclude that there could be an effect of practical significance. I would imagine an odd ratio of 1.007 wouldn't be considered to be of practical significance, so this data nicely span the range of values where there may or may not be an interesting effect.

The Bayesian approach would be almost the same, except possibly adding 0.5 to each cell (or more if one has substantive information), and wouldn't have a big effect here.

revere - I know that there can be differences between results with different approaches. But I'm not sure it'll lead to a revolution in statistics: I guess that has already happened (thanks to Intel, and a couple of physicists!). The revolution is more in the use of these techniques, rather than the theoretical underpinnings (well, unless someone can come up with reference priors that make everybody happy). I suspect it'll all get sorted out, after a bit of angst.

An excellent and readable explanation, including one of the better descriptions of frequentism for the nonspecialist I've seen. I can see you fantasizing about a world in which this can actually be explained to the average New York Times reader, who is so often presented with the word "proof" in a completely undefined and often misleading context.

I wonder if you've considered writing a piece like this for Slate, the New Yorker, or the like? The combination of your credentials, your writing ability, and a largely untold story ought to be pretty effective...

(Oh, and I notice the mammogram thing got killed. I can't help but wonder if the reason you haven't posted about this is because you're involved personally. Surely on the right side.)

moldbug: I've declined to get involved with the mammography issue here but it is a subject I know a great deal about from both the clinical and theoretical points of view. Like multiple comparisons it looks clean on the surface until you dig into it. It is one of the things that smart people take both sides on and each has something substantive to say. And some of it is not science, exactly, but more different utility functions to measure gain and loss from making a mistake. Not all of it, though. There are deep questions of science involved (and if you think about it, counterfactuals are key to the difficulties). Thank you for the kind words, though.

Bob O'H: I can see where giving the 95% confidence interval of the odds ratio would be a useful way of presenting the results of the vaccine study, especially if there were a minimum effect size that would be considered "of practical significance". I would hope that the cutoff for practical significance for something like a vaccine would be an explicit economic model of the cost of the vaccine vs. the benefit of the cases prevented, and not just an arbitrary round number like an odds ratio of 2.

Often in biology, we're more interested in knowing whether there is any effect at all; the size of the effect is less important. For example, I'm doing a study where I compare allele frequencies of a marine organism in 1983 with the allele frequencies in the same populations today. I want to know whether the allele frequencies have changed; I'm not as interested in how much the may have changed. For a question like that, would the classic frequentist approach, using P-values, be evil? I could see if the 95% confidence interval of the odds ratio excludes 1, but isn't that more or less equivalent to seeing if the P-value is less than 0.05?

"[A] Either a particular hypothesis is false... OR [B]something unusual [of probability p] has occurred." Perfect. Yet somehow in the minds of many (most?) practicing scientists this becomes "The probability that my result is due to chance alone is p."

There's a sort of laziness, disguised as "pure empiricism" that goes along with saying, "I don't bring any preconceptions; I just report my data, nicely summarized by this p-value." I'm all in favor of going with the data, but that needn't (and shouldn't) mean "only the data collected in the particular study which is the subject of this particular paper." What I'd like to see: without getting enmeshed in battles between competing schools of statistical interpretation, and without compromising their scientific detachment, authors could get in the habit of adding to their papers a few qualitative observations drawn from their experience or from the literature that could help a reader tease out which of the two branches above, [A] or [B], is more likely to be true.

Addressing the many scientists and science journalists who think "the probability the result is due to chance alone is p", I recently wrote a two-part post, described two simple cases, one in which the p-value is wildly different from the probablity the result is due to chance alone, and one in which it gets it about right. And like everyone else, I chime in on the Thai HIV vaccine trial. I'll plug my post here:

http://badmomgoodmom.blogspot.com/2009/11/what-is-p-value-of-bullshit-p…

One point of clarification. In the usual set up, subjects do NOT have probabilities of infections. Each subject will or will not go on to get an infection, so infection status is a latent (unobserved at baseline) variable waiting to be observed once the trial starts. The strong null hypothesis is that the probability of infection is the same in each group. Groups have probabilities; subjects do not. The probability stems from the random allocation of subjects to groups.

Also, we can all agree that p-values are horrible. But count me out of that "we". Can somebody please explain to me what is so wrong with p-values per se, as opposed to 1) unthinkingly always using 0.05 as an alpha level and 2) confusing the p-value with the probability that the finding is true? Given that the party with the greatest vested interest in the outcome of the experiment is the party conducting that experiment, and given that discretion is routinely used to swing results towards the desired direction, it is entirely reasonable to require more than the appearance of benefit. Something more is needed, and one way to operationalize this additional hurdle is with the modest and humble p-value.

Vance: Regarding your first comment, note what I said:

The interpretation in this case is that the difference that was seen would be pretty uncommon if the probability of HIV infection were the same in both groups, where uncommon means only 4% of the time

So we agree. I had hoped it was clear.

The question of p-values has been hotly debated in the literature. The usually advanced alternative is the confidence interval (although here again the usual debates appear as to what they mean), but if you only use a 95% confidence interval to see if the null value is included you might as well use p<.05. But the CI also tells you more about the precision of the estimate via its width. However maybe even better is a p-value function. Or not. I know just by suggesting any of these things I should be prepared to duck. While practicing statisticians may say these are theoretical (aka irrelevant) side issues, even an innocuous post like this one is generating heat.

BobOH: I don't think it's statistics that will get revolutionized but epidemiology and other sciences where it is controlled observation rather than experiment provide the data.

Beware of Simpson's Paradox, the jaws the bite the claws that catch. Beware Lindley's paradox, and shun the frumious Einstein-Bohr debate.

Yanother: Well both Simpson and Einstein-Bohr have great attraction for me. Lindley, not so much, as I don't get invplved much in the frequentist/Bayesian debate. I'm interested in lattices, so I guess that explains my weaknesses.

bob o'h
bayesian p-values do not suffer the same problem of frequentist ones. mainly because p-values are used to answer the question "how good the null hypothesis is?" and in the frequentist approach they tell you how likely is to observe data if the model (the null hypothesis) is true, which is not the same thing obviously. Well, really frequentist p-values tell you how likely is to observe data more extreme than that really observed, so they do not obey to the likelihood principle, which is at the heart of statistical foundation. So, frequentist p-values do not make a consistent, I mean a logical, use of probability

Often in biology, we're more interested in knowing whether there is any effect at all; the size of the effect is less important. For example, I'm doing a study where I compare allele frequencies of a marine organism in 1983 with the allele frequencies in the same populations today. I want to know whether the allele frequencies have changed; I'm not as interested in how much the may have changed. For a question like that, would the classic frequentist approach, using P-values, be evil? I could see if the 95% confidence interval of the odds ratio excludes 1, but isn't that more or less equivalent to seeing if the P-value is less than 0.05?

If that genuinely is you question, I've got good news for you: getting the answer is cheap and quick. Just ask yourself what is the chance that the allele frequencies are exactly the same as they were in 1983? The probability is negligible. No PCR needed!

All your significance tests do is tell you whether you have enough data to show the difference you know a priori almost certainly exist. I wrote a blog post about this last year ('why p-values are evil').

Esa LÃ¤Ã¤rÃ¤ had a paper in Annales Zoologica Fennici this year where he argues against the silly null hypothesis. It looks like you're a victim of the problem: null hypothesis thinking is so pervasive is science that it's affecting the scientific questions we ask. Quite frankly, who cares if some allele frequencies have changed slightly? Isn't it more important to ask what factors are influencing the changes - drift, migration, selection? That's where the biology is.

Sorry for not providing any links - I'm replying on my iPhone and my Apple-Fu isn't well enough developed. And I can't work out how to scroll down long comment boxes either.

Could be worse - my wife is reduced to playing mah Jong on her iPhone, rather than blogging. Hopefully the wifi man will turn up in the next hour.

Revere

Thanks for this, this is such an interesting and educational discussion. I see David has now written another post on biostatistics, focusing on confidence intervals.

If it's not too much trouble, can you correct the spelling of my first name? Thx.

Yikes. Sorry. That was really a typo and not a mistake. I know how to spell your name. I was thinking about what I was going to say and not paying attention to what I just said/typed. As for confidence intervals, I'll see. There were a couple of things in the previous posts that could also use comment (the plausibility question is important and rarely discussed) and confidence intervals are another subject where the bricks start flying. I'm not so sure I want to get too deeply enmeshed in a subject that isn't the main focus of this blog, but I'll think about it. I did it this time because I was tired and had time on my hands (in other words I was procrastinating about the huge pile of urgent things in front of me and this was pretty easy to do).

As Bob O'H suggests above in 3, I think revere somewhat overstates the seriousness of factionalism in statistics for its day-to-day operation. Hardcore Bayesians and Frequentists may exist, but I would assume that they generally do so at a safe distance from actual data and data-generating scientists :-)

Granted, on a conceptual level, these are serious differences. However, even there the underlying assumptions and consequences, as well as the zones of agreement and disagreement are quite well mapped out. Reasonable analyses of real data will generally arrive at the same conclusions, regardless of the philosophical underpinnings; the clearer the message and the larger the data set, the closer the agreement. The closer we move to scientific consensus with our aggregated data, the less it matters.

Statistics has been fantastically successful during the last 100+ years (and that includes the p-value, like it or not). That not all of the tools shape up for some of the complex, highly confounded situations found in modern epidemiology, or the noisy high-dimensionality of gene expression data is really unsurprising, given that they were developed for very different research problems and data environments (e.g. agricultural experimentation in the 1920s). It's more surprising how far we actually get with the old tools, and how a proper understanding of the foundations of statistics allows their adaptation to new situations.

One deeply felt frustration with this process is that it always seems to take forever. It's a classical software/hardware gap: by the time we have figured out how to deal with new kinds of data, technology has moved on, and there are even bigger and more complex data sets waiting. Personally, I blame cultural differences - starting with review times in statistics journals, but also the apparently unconquerable urge of statisticians to invent their own method, regardless of how equivalent it is going to be to existing ones; getting everybody's favorite pet published and sorted out in the greater context of things takes serious time. Maybe this pig-headed insistence on having many different paths to the same destination is a consequence (and a hidden cost) of working in a field resting on such varied fundaments? Of course, :-)

I'd also like to suggest that there is no active conspiracy to keep these things from non-statisticians (or at least I have not been invited). That introductions to statistics often present the field as monolithic and divinely derived from first principles may partly be explained by early mathematical training of the practitioners - it's a hard habit to shake. The other side is that time and money are limited: so are statistics courses (and the supporting literature), but also the interest of the subject-matter scientists, like biologists and even some epidemiologists. Most people are quite happy to work with handful of methods that they are comfortable with, and rather throw a couple of extra cell lines on the grill (so to speak) than optimizing their statistics. [And if we think about it in terms of software/hardware, this may be a totally legitimate strategy, corresponding to buying a bigger computer instead of re-writing someone else's software]. These people are generally not so interested in the foundational aspects of likelihood approaches etc.

Naturally, there are also those who are deeply invested in these questions. These can be among the most satisfying collaborations for a statistician.

"If we wanted a better idea of the coin's "true" probability of coming up heads we'd flip
it many more times, say 1000. Or 10,000. The more times we flip it the better fix we
get on the true probability of it coming up heads (which for a coin that was truly "fair"
would be 50%). "

Not true.

If you took 256 perfect coins and flipped each 10000 times you would have 256
realizations of a 10000 step random walk.
Many of the realizations would have almost equal number
of H/T. Some would have H-T > 0 ( H - T <0 )
by a considerable amount (maybe on average |H-T| = 100).
One might imagine using the value of H-T for each coin as a measure of the
fairness of the coin?

A second run of the 256 X 10000 test would reveal a result (set of values of H-T) quite
similar to that of the first. But the coins (realizations) responsible for the outliers
would very likely be different in the two runs.

It is possible for a single realization to be a very poor measure of fairness of the coin involved. You need a more sophisticated test.

Alexander: Granted, on a conceptual level, these are serious differences. However,
even there the underlying assumptions and consequences, as well as the
zones of agreement and disagreement are quite well mapped out.

I agree there isn't an active conspiracy to keep these from consumers of statistical method (like David Rind and myself). The problem I was pointing to is that the fact that there are differences is not known and even for a very sophisticated user like Rind, "surprising." And it's not just Bayesianism and frequentism but likelihood and now causal graphical methods and Bayesian networks and a lot of other stuff. You are absolutely correct that the problems that require statistical methods have changed greatly since Fisher and experimental design in agriculture, but the methods haven't changed that much and the ones that have are the ones where there are controversies.

I don't let epidemiologists off the hook. We are highly skilled in thinking about sources of bias but many of my colleagues come close to being non-numerate. The smart ones let statisticians do their statistical work for them but professional statistics (especially biostatistics) has another problem. It is severely understaffed and over worked. Whether they are writing up methods sections for new grants or "servicing" existing ones, most of the academic biostatisticians I know are working long hours and not getting any of their own work done -- assuming they even have research work of their own.

This post was meant to add some observations to Rind's post. There was really much more to say about everything he talked about and I just picked some low hanging fruit.

RAG: No. The coin flipping example was to illustrate the frequentist notion of what probability means. For a frequentist it is a limiting value, which for a fair coin is .5. You need to let n go to infinity and see if the frequency converges to a limit. If that limit exists it is the probability of the coin landing H or T, which for a fair coin is .5. That's just for a frequentist. There are a number of other notions of what probability could mean, including the strict Bayesians and the Popperian propensity advocates and others. It is an idealization that can't be realized in the real world, so your examples are not relevant to the point being made.

Revere, before saying RAG's points are not relevant, you may want to note that, for those of us not sophisticated in statistics RAG is in fact elucidating one position quite helpfully. I am wondering if you would mind giving some Bayesian (and, if time, perhaps Popperian or other) example(s)/explications, preferably parallel to RAG's coin, so we may see how the reasoning/concepts differ among these models. My own studies were mostly group theory and philosophy, so some of this seems very strange (that is, my response to RAG's example becomes: If they come up .76, say, then we'd probably look for the coin's imperfections--or else: Well but the next 10000 tosses could still come up all T. The real-world finding of probabilities thus drops out). But it would be good, for those of us not statisticians, to get a clearer sense of the substance of this debate.

Paula: I probably didn't spell it out enough. The issue with the frequentist notion of statistics is that it isn't realizable. It supposes that the "true" probability is a limiting value as n goes to infinity. So it's an abstraction. It's not the only notion of probability you could have. You might, for example, use the Bayesian notion of subjective probability (what odds would be willing to give in a fair bet?) or the idea that it represents a propensity to come up one way or another. All those are different from the frequentist perspective. So there is no actual realization involved. That's why I said it wasn't relevant what any particular realization would do. We are talking about the meaning of probability, an abstract notion.

Thanks, Revere. Yes, the point's very clear re frequentist concept. Guess I should have read the whole discussion more carefully before posting.

There need not be reference to any limit to define a valid notion of frequentist probablity, and I would humbly suggest that we are using the wrong paradigm anyway. No data set ever did, or ever will, have a normal distribution, so let us put that fantasy notion aside and, as Dr. Phil might say, get real. What is real is the random allocation of subjects to treatment groups. This finite set of possible realizations of the randomization process actually used allows for valid frequentist probability statements. In other words, randomization as the basis for inference, or exact design-based permutation tests rather than unjustifiable approximations to them, in the form of parametric analyses.

Vance: True, there are different flavors of frequentists just as there are different kinds of Bayesians, etc. But the limit concept is the meaning of probability from the frequentist point of view. It is an abstraction, like a real number (which also doesn't exist in the real world). As for randomization, that's done with some apparatus (if by a computer, it is a pseudorandom number) so that's also an idealization. Whether you have a "need" to reference limits will depend on what else you believe or even just what you are willing to settle for.

Revere, what you say is certainly true. Still, I do not in any way equate 1) treating our randomization as truly random with 2) treating the data as normally distributed. That the randomization is actually "pseudo" is true, but almost a technicality. This fact can be completely ignored, except for the possibility of selection bias arising from the prediction of future allocations. This selection bias is an important (and all-too-often ignored) consideration, but to my mind it is not the same issue that we are discussing. Maybe it is, but I am having a hard time seeing the connection.

Vance: I guess there's a lot of metaphysics here to contend with (in the sense of what exists). Real numbers don't exist in the real world, either, at least I haven't seen any. But as for how much the normal distribution "exists" in the real world, there is the little matter of the Central Limit Theorem which suggests why we would expect normal distributions and also the Boltzmann Distribution. The question of pseudorandom numbers is, I agree, rarely of practical significance (although there have been instances where it was suspected of being a problem when the seed didn't change). But if the question is one of existence or just approximations to it, it's all or none. It's not a technicality. But I know where you're coming from and I am discussing things that are rarely determinative. But I do think epidemiology is in theoretical trouble and a deep re-thinking may be coming.

I love these posts, thank you so much. I think these questions are extremely relevant to how we think of the world, the questions we think we should ask when doing science and even how these relate to our ethical axioms or foundations. At least it makes me think about it...

Statistics, damn statistics and well kept secrets

More like this

Thinking about confidence intervals

Political polling

A Critical Cause of the Decline Effect: When Weak Effects Meet Small Sample Size

Not All in our Genes, The Sequel

A note tacked to the door

We bid you farewell

Freethinker Sunday Sermonette: summing up

Blog matters: who is "revere"?

Reading about the hazards of what I used to do as a youngster

Weekend Diversion: Why is Oil in the oceans so frightening?

Rocket Science

The Sun’s Energy Doesn’t Come From Fusing Hydrogen Into Helium (Mostly)