Thinking about confidence intervals

Like David Rind over at Evidence in Medicine I'm a consumer of statistics, not a statistician. However as an epidemiologist my viewpoint is sometimes a bit different from a clinician's. As a pragmatic consumer, Rind resists being pegged as a frequentist or a Bayesian or any other dogmatic statistical school, which is wise. Let the record show most practicing statisticians are similarly pragmatic, as was the great R. A. Fisher, who thought there was a place for different viewpoints in different contexts, even though Fisher was famous for his argumentative and contentious manner (people are complicated). Having made those disclaimers (and without claiming any statistical allegiance of my own), the picture of confidence intervals Rind presents in another excellent post, "Interpreting Confidence Intervals and Grading Recommendations," is thoroughly frequentist and within that, clearly identifiable with the Neyman-Pearson school. Having thrown all those names around, I guess I'll have to explain them (frankly, his post is just a pretext for me to make some observations about things that interest me or I want to think more about by explaining them in writing).

First, Neyman-Pearson, the names of two great pioneers in statistics, Jerzy Neyman and Egon Pearson (Egon was Karl Pearson's son; the father, too was a great pioneer of statistics). If this post isn't going to be of inordinate length and complexity, I'll have to simplify this, but in essence, Neyman-Pearson viewed probability in dogmatic frequentist terms (see our previous post) and in addition, thought that the practice of statistics was for decision making. Here's the part of Rind's post that pegs him as using the Neyman-Pearson framework:

Again using the HIV vaccine example from my last post, we can look at the same parameter that had a p value of 0.04, but now examine the point estimate of efficacy (31.2%) and its 95% CI (1.1 to 52.1%). What is it that has a 95% chance of being true, given that CI of 1.1 to 52.1%?


However, despite recognizing what the CI [confidence interval] really does and does not tell us, consumer of biostatistics that I am, I (and others) approach CIs operationally: we choose to interpret the CI as a range of values with which the data are reasonably compatible, and to interpret values outside the CI as reasonably incompatible with the data. So, other things being equal, I would say that a vaccine efficacy of 5% was compatible with the results of the NEJM study, while a vaccine efficacy of 60% was not. This does not mean that I think the study has excluded the possibility of the vaccine having 60% efficacy, just that this would be unusual under the plays of chance.

This operational definition works as I decide how to write recommendations about whether to administer such a vaccine. If the vaccine truly had an efficacy of 31% I would likely recommend wide use in high risk patients. If the high end of the CI were true (52% efficacy), I might recommend universal vaccination. If the low end were true (1% efficacy) I would probably recommend leaving it on the shelf. Looking at this, I can quickly realize that if this trial were the only information I had about vaccine efficacy then I have inadequately precise data to support whatever recommendation (or set of recommendations) I might want to make about administering HIV vaccine. (David Rind, Evidence in Medicine)

Notice that Rind is explicitly using the confidence interval (CI) for decision making. This goes beyond the p-value, although it sounds the same on the surface. The logic of the p-value was that either the null-hypothesis was false OR something very unusual happened. But in the Neyman-Pearson world a scientist looks at the evidence (the data) for the purpose of making one of two decisions: accept the hypothesis under consideration (e.g., that the vaccine makes no difference) or reject it (that the vaccine is effective in preventing HIV infection). p-values allow you to reject a hypothesis but not accept one. But in the Neyman-Pearson framework you do accept and reject and when you do you can make a mistake: accept a hypothesis when it is false or reject a hypothesis when it is true. That's how Rind winds up talking elsewhere in his post about sensitivity and specificity, which are measures of the same kinds of mistakes in the screening world (not exactly the same but convertible to them by subtracting from one; this isn't important but I know someone will call me on it in the comments so I'm being pro-active). The point of Neyman-Pearson statistical methods is to balance those two kinds of mistakes to get the optimum, where what is optimum has to be specified in terms of the kinds of trade-offs you are willing to make. Neyman-Pearson try to maximize the probability of rejecting a false hypothesis once you have stipulated what probability you are willing to tolerate for rejecting a true one (usually a small number like 5%).

The reason this is a balancing act is the same reason that balancing sensitivity and specificity of a screening test is a balancing act. Sensitivity is the proportion of true cases your test picks up. You can always make it 100% by saying all people tested have the disease. That way everyone who has the disease will be picked up and the sensitivity will be 100%. Likewise you could make it certain (probability = 100%) that every true hypothesis is labeled true by your statistical test just by accepting all data as valid evidence of the truth of the hypothesis. But you see the problem. In these cases you'll either be saying people who don't have disease do or that hypotheses which are not true, are. Unless the evidence is overwhelming and obvious (in which case you don't need a statistician) or you have to compromise. What you have to do is name some kind of level of certainty you want to get to (e.g., pick up 95% of people with disease or 95% of true hypotheses) and then see what that means for how many people without disease you falsely pick up or how likely it is you will say an alternative hypothesis is false when it isn't. To do that you also have to stipulate one or more alternative hypotheses, which isn't required for a p-value. For a p-value you don't have to say how the null hypothesis is false. But to minimize the probability of rejecting a true hypothesis you do and this requires you to make some assumptions about the alternatives. So the result is that when you do your statistical test you are sorting hypotheses into one of two mutually exclusive regions, acceptance or rejection, based on all the alternative hypotheses of interest.

None of this seems to have much to do with confidence intervals, but it is the framework that underlies Rind's explanation. Like his version of p-value, he gets it exactly right, which is impressive as I doubt that 1 in 100 clinicians or even 1 in 10 epidemiologists can give the correct version of a confidence interval in terms of a coverage probability, which is what he manages to do. He comments that it is hard to explain, and one of the reasons is that the confidence interval is about probabilities but for a frequentist probabilities are "long run" (as in forever) frequencies of repeated trials. Which brings us to the real point of confidence intervals. They are used principally for estimating something, i.e., using the data to infer what some unknown quantity (like vaccine efficacy) might be. That's some number (say 31.2%). It doesn't have a probability nor does it make sense to say its probability of being in a certain interval (in this case between 1.1% and 52.1%) is 95%. Either it is (100%) or it isn't (0%). It's like if you were to ask a shopkeeper for the price of something and she said it was somewhere between 10 cents and a dollar with 95% probability. It has a price, not a probability. So what is the 95% referring to?

For the Neyman-Pearson statistical framework (which is what is overwhelmingly used), it means that if you were to draw repeated samples from the population of people we want to protect with a vaccine and subject them to a randomized clinical trial (and assuming no glitches like a bad lot of vaccine or people starting out already infected or people dropping out or the vaccine being administered wrongly, etc., etc.) and you looked at each set of data generated that way and applied the appropriate statistical procedure to generate a confidence interval, that 95% of the time (in the long run) the zillions of intervals would include the true (the actual) vaccine efficacy (this is frequentist talk; Bayesians don't think the outcome of the experiment is a random variable but the parameters of the process that generate it are). In other words, you have to imagine zillions of confidence intervals being generated by zillions of (imaginary) clinical trials just like the one you have in front of you. Since you can't do these zillions of trials in the real world, you have to make some assumptions about the idealized random mechanisms that are generating the data (e.g., that it's like coin flipping or tossing a biased die or whatever). That allows you to make calculations with the help of some data from your trial. Not only calculations, but decisions, and that's how Rind is using it. Neyman-Pearson would call that inductive behavior.

This is a long explanation and it comes with two provisos. The first is both defensive and for purposes of warning everyone. This is tough stuff and it's not only hard to understand it's easy to make a mistake when explaining it. I've more than once in my long explaining career (and that's a good part of being a Professor) reversed things either accidentally or because I got muddled, so this one comes with no warrantees. One of my colleagues is fond of saying, "Real peer review comes after publication." Consider this publication. The other is that there are lots of ways to explain it and statisticians are a contentious lot and either take umbrage on what you included or failed to include or fasten on technicalities that aren't of much consequence to the ideas for most consumers of statistical methods. I hesitated to even post about this, knowing the high probability (in this case in the Bayesian sense) that I would get burned in some way. But I like to write about it and I do it because writing is also thinking, so it helps me, not just when I get it right but when I get it wrong.

That's an open invitation to criticize and/or correct me (or feel free to tell me I'm a genius). In the meantime, take a look at David Rind's post ot get a clinician's take on confidence intervals.


More like this

Marilyn Mann pointed me to an interesting post by David Rind over at Evidence in Medicine (thanks!). It's a follow-on to an earlier post of his about the importance of plausibility in interpreting medical literature, a subject that deserves a post of its own. In fact the piece at issue, "HIV…
Jeremy Miles pointed me to this article by Leonhard Held with what might seem like an appealing brew of classical, Bayesian, and graphical statistics: P values are the most commonly used tool to measure evidence against a hypothesis. Several attempts have been made to transform P values to minimum…
[Previous installments: here, here, here, here, here, here] Last installment was the first examination of what "randomized" means in a randomized controlled trial (RCT). We finish up here by calling attention to what randomization does and doesn't do and under what circumstances. The notion of…
There has been more discussion at Crooked Timber on David Kane's criticism of the Lancet study. In response to Tim Burke's comment: Good faith skepticism starts with, "Ok, I want to look at why you're making this claim, and your evidence for it. I don't want to take anything on faith." Not, "I'm…

Hi Revere, thanks for this and all the other posts, in particular those on statistics, I've been reading them with great interest and they helped me clearing some of the muddle of a non-specialist :-).
On the matter of p-value, you've made well clear that one of the most frequent mistakes is using p value as an estimate of the "importance" of a result. Is it not another, and related, mistake thinking that p value is a statement about the hypothesis while it is in fact a statement about the sample? That it cannot be directly used to reject an explanation, but only to compare one explanation to the other possible ones, and then decide?

When I'm teaching my Bayesian course (with the acronym BAfFLs), I sometimes use the precise definition of a frequentist confidence interval as part of my argument for why the Bayesian approach to statistics is conceptually simpler. For me, the key point is that frequentists don't estimate the parameters, but only statistics, and have to make a further assumption that these statistics can be equated to the underlying parameter. It makes me wonder about the uncertainty in the parameter: frequentism doesn't seem to allow for that.

The point of Neyman-Pearson statistical methods is to balance those two kinds of mistakes to get the optimum, where what is optimum has to be specified in terms of the kinds of trade-offs you are willing to make.

I would disagree with this: it's the approach a decision theory approach would take, but the N-P approach doesn't specify any loss function. It just focusses on the null hypothesis, and doesn't take into account the risk of an error with the alternative hypothesis.

this is frequentist talk; Bayesians don't think the outcome of the experiment is a random variable but the parameters of the process that generate it are

It's more subtle than this: Bayesians do think it's a random variable (in the sense that it's a stochastic process), but that it's known, so it is conditioned on.

I really should look into the likelihood school more: they accept the likelihood principle, and seem to be somewhere between frequentist and Bayesian thought.

P.S. you're a genius.

Great post. I'm biased, but I think we need to discuss more openly and more often the concepts that underlie both medical and public health decision-making.

I do think it's extremely important to be able to say "these are the values compatible with the data," because, like it or not, most consumers of epidemiologic data aren't trained in the methods of data analysis.

I'd only like to add one thing: the 95% CI only tells us what conclusions are compatible with the data in terms of random chance; given the underlying assumptions about how the data are generated, here is what is compatible. Some of the misplaced hubris in medicine and public health (and science in general) certainly comes from the mistaken leap that a 95% CI can be interpreted in the more generous, subjectivist way. To get there we need to add sensitivity analyses, and see what happens when our underlying assumptions break down. People take this for granted when assessing model results, but forget that all data analysis is a form of modeling.

silphion: Well, p-values are about hypotheses (most cleanly, the null). They say that either the null is false (you reject it) or you sample is really unusual, where unusual means what you say it is (conventionally in biology, less than a 5% probability of happening). If you want to do hypothesis testing, then you have a hypothesis and alternatives. You accept or reject depending on what region the test statistic falls in. The biggest mistake with p-values is to mistake failure to reject the null with acceptance of the null, an egregious error.

BobOH; Fisher's complaint about N-P (one of them, anyway) is that it regarded things in terms of quality control or decision theory. That's historical, perhaps, but that's the origin. The loss function is not explicit but is embodied in the choice of alpha and beta. At least that's the way I see it.

You brought up an interesting point that made me think. I tend to think of frequentist methods as assuming there is a real, true value for the parameter "out there" and the job is to guess it (estimate it) as well as possible. It is fixed. It isn't a random variable. While for Bayesians, the parameter is not fixed but is a random variable. It's the data that are fixed. You phrased this as conditioning on the outcome, which amounts to the same thing mathematically but not metaphysically. Interesting.

As for the "PS: You're a genius," first, so as not to be ungracious (even though I know you are kidding), thanks. Second, as a proposition it exists but has measure zero, at least based on the evidence.

Ryan: I agree with what you say. I very frequently talk about this with biostatisticians and they are mostly not interested as they are so busy doing statistics as a service that the questions that interest me don't help them do their job. But the consumers of statistics need to be made more aware that statistics isn't just pushing a button on a computer. It takes real skill, judgment and there are options and controversies. That was mainly the point of my first post on p-values, I guess. Although I think the real point was I like to write about it so as to help me to think better about what I am doing.

As for likelihood methods, they are very attractive. By personal historical accident, the first probability course I ever had was taught by a very green (perhaps his first year as an Assistant Professor) Richard Royall. Long time ago, and I was already an MD and didn't appreciate what I was getting (although I think the course was pretty conventional).

Another good one. I had no idea there was a name for this:

But in the Neyman-Pearson framework you do accept and reject and when you do you can make a mistake: accept a hypothesis when it is false or reject a hypothesis when it is true. [...] The point of Neyman-Pearson statistical methods is to balance those two kinds of mistakes to get the optimum, where what is optimum has to be specified in terms of the kinds of trade-offs you are willing to make.

One of the arguments made often by corporate stooges who are critical of our responsible and scientific public-health authorities is the strange way in which current law requires FDA to approach this problem.

Consider the question of efficacy testing for a new cancer drug. Say a Type 1 mistake means an effective drug is not approved, and a Type 2 mistake means an ineffective drug is approved.

It is immediately clear that the Type 1 mistake results in great suffering, whereas the Type 2 mistake results in at most economic waste. Furthermore, most of the economic waste is the result of the high cost of the drug, which is the result of the onerous approval process - designed to prevent Type 2 errors.

But we see that the Neyman-Pearson analysis is strongly biased in the opposite direction. If the Type 2 error is relatively harmless whereas the Type 1 error is lethal, the confidence interval should preclude any Type 1 errors at the expense of tolerate a few Type 2 errors. Instead, we see a process that appears to tolerate Type 1 errors and rigorously preclude Type 2 errors.

In plain English, the dials on this process appear set to allow absolutely zero approved quackery, at the expense of systematically preventing many successful treatments. This seems to match the professional interests of doctors and medical regulators quite well; the therapeutic interests of patients, less well. An optimality analysis would suggest that, instead, questionable cancer drugs should be banned if they can be proven ineffective - n'est ce pas?

Now, for efficacy confirmation of a cancer drug, the Neyman-Pearson dials appear to be set quite wrong. For safety confirmation of an OTC painkiller - they appear set quite right. My question is why they are not set differently in these cases. Do you attribute this to some rational process, or is it just an artifact of Congressional history?

wouldn't it be better to give deviations instead of confidence-intervals ?

wouldn't it be better to give expectation value and
deviation of the author's subjective estimate after
conidering all factors and not that one special
study design, which often is poorly communicated
or ambiguous or prone to bias ?

I've been searching for an answer to a question about confidence intervals that no one seems to discuss: "do confidence intervals only address random sampling error or do they address other random measurement error?" Given the way CIs are usually calculated it would seem to me that they generally only address random sampling error, yet there is often a lot of other random measurement error in epidemiological studies. Am I missing something here?

Rod: I think it depends on how the CI is calculated. We haven't talked at all about the error structure, i.e., what is the distribution of the error term and what goes into it. There are many ways to calculate a CI. We addressed what it meant once calculated. As a coverage probability it means that 95% of the calculated intervals cover the true value during repeated sampling (the frequentist perspective).

revere: thanks for the rapid response. I am however still confused. Lets say we calculate a CI for a proportion (or for a difference between two proportions)using p(p-1)/n to calculate the SE(p). Does this approach only address random sampling error or does it also address random error in the measurement of the outcome and, in the case of a CI for the difference between two proportions, does it address random error in the allocation of people to the two populations being compared? I hope my question makes sense.

Rod; Like you, I am a consumer of statistics and not a statistician. But here is my provisional answer. The idea is that you are estimating an unknown quantity (this is the frequentist perspective). It is fixed but in the population sample the measurement varies for lots of reasons: random perturbations from the environment, measurement error, etc. That variability shows up as sample variability. So the CI is measuring the sum total of all those random effects that obscure the underlying parameter you are trying to fix. Others should feel free to chime in.

Revere; that was originally my understanding as well - that CIs measure the sum total of these random effects. But whereas I can understand how a SE derived from actual data could be describing the sum total of random errors, it is less obvious when the SE is derived from p(1-p). It is the sort of question a inquisitive student asks and I have no explanation.

Rod; I would say it differently. Statistical inference is based not only on estimating p but taking into account variability. Without that we don't have much statistical inference.

Hi again revere. Sorry to labor this issue but I have just been re-reading Doug Altman's chapter on CIs in the 3rd edition of 'EBM: how to practice and teach EBM' by Straus and colleagues. Doug is a highly respected biostatistician from Oxford and he makes the following statement on page 264 of the book.
'The CI is based on the idea that the same study carried out on different samples of patients would not yield identical results, but that their results would be spread around the true but unknown value. The CI estimates this sampling variation. The CI does not reflect additional uncertainty due to other causes; in particular CIs do not incorporate the impact of selective loss-to-follow-up, poor compliance with treatment, imprecise outcome measurements, lack of blinding, and so on. CIs thus always underestimate the total amount of uncertainty.'
It was his comment about 'imprecise outcome measurements' that led me to believe that he included all other non-sampling causes of random error 'other errors' not included in the CI. I have emailed him to clarify this.

Rod: What he is saying is exactly the idea of a coverage probability as I explained it in the post. He is also making the point -- which we also made in the post -- that the CI does not reflect all sources of variation, in particular, sources of systematic (i.e., non random) variation, usually called bias. Note that all his examples are examples of bias. So there is no conflict with what we have said. It is exactly the same thing. But the idea of "sampling variation" includes all the sources of random variation in a population. That variation may have a particular structure to it (error structure) that needs to be taken into account.

Hi again. I understand that a CI cannot account for systematic error. I also understand how the CI for continuous data can take account for some measurement error given that the SE for a mean takes the SD of the data into account. However I still dont understand how this works for a proportion that only has p(p-1) in the SE equation. Consider a situation where the measurement of outcomes in an RCT is done either with an unbiased but âblunt instrumentâ like the clinical judgement by multiple doctors or alternatively is done using an unbiased but more accurate standardised algorithm. The first example has a lot more random measurement error than the latter but the calculation of a CI could be identical.