[Previous installments: here, here, here, here, here]
After a detour through the meaning of causation and the need to find a substitute for what can’t, in principle, be observed (the counterfactual), we are now ready to consider what many of you might have thought would be the starting point, randomization. It’s a surprisingly difficult topic and this post will probably be more challenging for non statisticians, but I feel confident you don’t need to be an expert to understand it.
First a quick recap. If you want to know if mammography screening will prolong the life of a woman under the age of 40 (Jane Doe), ideally you would first screen Ms. Doe, observe the outcome, and then turn the clock back and repeat. Any difference in time of death is due to mammography (even one minute). Since we can’t do this nor can we find an exact substitute for Jane Doe, we consider some work arounds. The first would be a whole population of Jane Does. Can we find a substitute for them? We want to make sure that all of the women in the population are Jane Does, at least in the minimal sense that they are at risk for getting breast cancer. In randomized clinical trials there is usually some attempt to make the study group representative of the target population (in this case, at risk women under the age of 40), but it is rare to try to make them representative by random sampling of the population. To do this you would need a list of all Jane Does to start from and this isn’t available. But it turns out it isn’t necessary for validity of the trial, although it may affect how generalizable the results are.
Representativeness isn’t where randomization comes in. So let’s move on. If you have a group of candidate Jane Does you might consider splitting them in two, screening one group and leaving the other group unscreened. Is the unscreened group an adequate substitute for what you can’t have, the population of screened Jane Does with the clock turned back so you can see what happens when they are unscreened? Maybe, but unlikely. You have no guarantee that the newly formed unscreened subgroup will be an adequate substitute. Nothing will make the group the same as the screened group, not even randomization. People differ in all sorts of ways from each other even within the screened group, which means you also can’t get a group, no matter how, that is identical to the screened group. This is obvious once you think about it, but there is a common misunderstanding that randomization makes the two groups the same. It doesn’t, nor is it required that it do so. Here is how statistician Paul Rosenbaum puts it:
First, experiments [here, randomized clinical trials] do not require, indeed cannot reasonably require, that experimental units be homogeneous, without variability in their responses. Homogeneous experimental units are not a realistic description of factory operations, hospital patients, agricultural fields. Second, experiments do not require, indeed, cannot reasonably require, that experimental units be a random sample from a population of units. Random samples of experimental units are not the reality of the industrial laboratory, the clinical trial, or the agricultural experiment. Third, for valid inferences about the effects of a treatment on the units included in an experiment, it is sufficient to require that treatments be allocated at random to experimental units – these units may be both heterogeneous in their responses and not a sample from a population. Fourth, probability enters the experiment only through the random assignment of treatments, a process controlled by the experimenter. A quantity that is not affected by the random assignment of treatments is a fixed quantity describing the units in the experiment. (Rosenbaum, P, Observational Studies, 2nd Ed.)
So if randomization isn’t making the two groups “the same,” what is its function? We’ll take Rosenbaum’s approach and follow the ideas of the originator of randomization in experimental design, R.A. Fisher, one of the giants of 20th century applied statistical theory. We’ll change Fisher’s subject (an English lady who claimed she could tell by tasting if milk or tea had been added first), but use Fisher’s numbers. For variety we’ll change the example once again and consider 8 people being treated for low back pain with acupuncture (I like to be provocative). We will randomly assign 4 low back sufferers to acupuncture done according to acupuncture theory and 4 to sham acupuncture (the needles are in inappropriate places for back pain). We are going to make only one assumption: that acupuncture has absolutely no effect on back pain. This is an assumption most of you won’t have a problem with, but it is a surprisingly potent hypothesis, something statisticians call the “null hypothesis.” If it turns out to be an unlikely assumption given the results, it will imply that acupuncture has some effect. In some views you could substitute any alternative hypothesis for the null, but for Fisher the null has a privileged position. We’ll explain why, shortly, and in the next post discuss some pertinent differences of Fisher’s view from other views.
To continue, at the end of a week we score each person in a way that indicates whether they have improved. Assume this is done blind to treatment and by some reliable scoring measure for evaluating lower back pain. Let’s say we find that four improved and four didn’t and we want to know if there is a relationship between improvement and proper acupuncturing. Don’t worry that this trial is small. We are illustrating a principle and, as noted, these are the same numbers Fisher used. The statistical test we will explain is widely used for small samples and is called the Fisher Exact Test.
Let’s pause to summarize what we are working with. We have 8 people, all presumably candidates for having their low backsymptoms improved by acupuncture. We have randomly assigned 4 of them to treatment, 4 to placebo. They don’t know which treatment they got and the person who is scoring their symptom improvement doesn’t either. The question is whether the results are likely, given the null hypothesis.
The results are summarized in the form of a statistic. A statistic is a number derived from the data. A mean value (also called an average or expected value) is a statistic. So is a standard deviation. But so are simpler numbers, like the number of treated back pain patients who improved. That is the statistic Fisher used. Where does chance or probability come into this scheme? As Rosenbaum notes, only in one place: the random procedure that assigned treatments to four subjects and placebos to four others. It was completely under the control of the investigator and the probabilities of any possible treatment assignment can be calculated (or approximated) in advance.
After the randomization, everything is determined, at least if the null hypothesis is true. This is a key point. Whether a person improves may depend on a lot of things but the one thing it doesn’t depend on is the treatment, assuming the null hypothesis. Why? The assumption says that for any particular person the outcome is the same (and fixed) whether they got real or sham acupuncture. This is the way Fisher’s randomization together with the null hypothesis works around the counterfactual. It is a type of thought experiment that turns the clock back by saying that if the only thing that is different is the treatment, but the treatment doesn’t do anything, then nothing is different.
Maybe you understand this immediately, but it is a hard thing for many of us to get our heads around, so let me belabor the point with another way to understand it. Suppose whether an acupunctured (sham or real) person improved or not was decided by the flip of a coin (where the sides were labeled improved or unimproved, not heads or tails). Each person has his or her own coin but the coins might differ for all sorts of reasons, just as people do. Since flipping is accomplished by the treatment but the treatment doesn’t do anything, flipping the coin doesn’t change the outcome. The null hypothesis is like giving a person a coin that has improved on both sides or unimproved on both sides. Flipping it (acupuncturing, sham or otherwise) can’t change the outcome. If you are tempted to say that just by virtue of being in the trial a person’s outcome might change, you are really saying that the treatment (sham or otherwise) does have an effect, no matter how small or insignificant, which is contrary to the null hypothesis. Fisher’s null is sharp and strict. No effect really means no effect.
Once you grasp this, Fisher’s analysis makes sense and is really quite ingenious. Everything but the randomization was fixed independently of the study. You have 8 people and they have labels on them, “improved” or “not improved” according to the examination by the blinded evaluator. What wasn’t fixed ahead of time was the result of the randomization, which produces another label on each of them: “treated” or “placebo.” Those treatment assignment labels were where randomization came in to it. We are interested in how often the two sets of labels “line up,” that is, how many times did a treated person improve and an untreated one not improve? There are a lot of possible statistics we could use here, even in just counting the different categories, because once you fix the number treated and untreated and the number improved and not improved, you only need to determine one other category, like the number of correct label line-ups (treated improved plus placebo-unimproved) or even just the number of treated who are improved. Either determines all the rest of the categories (for those more experienced with data tables, you can satisfy yourselves this is true by making a 2×2 table with 4 fixed marginals — 4 treated, 4 placebos, 4 improved, 4 unimproved; you have four empty boxes in the middle; fixing just one of the interior boxes is enough to fix all the others). We’ll use the number of correct assignments as our statistic. The maximum number is 8, that is, a perfect line-up. How likely is that to happen under a random assignment of 4 treated and 4 placebo?
The answer is a simple problem in combinatorics, “eight choose four,” or 70 distinct ways to label 8 people as 4 treated and 4 placebo. Think of this as a big fish bowl full of label assignments, each one a strip of paper with giving the labels of 8 people. There will be 70 different strips (label line-ups — example of one such assignment: Jim/T, Joe/P, Susie/P, Alice/P, Jane/T, Bill/T, Harold/T, Jennifer/P). You reach into this bowl and pull one out “randomly.” What is the chance you got the single one that was the perfect match to your results? One chance in 70. Suppose, however, the line-up wasn’t perfect but you got 6 out of 8? (you can’t get 7 out of 8 because if you make one misalignment you automatically make a second). It’s fairly easy to calculate that there are 16 assignments in the fish bowl where there are 6 of 8 correctly aligned with the treatment. 16/70 is almost 25% (give or take). That’s about the chance of flipping a coin twice and getting 2 heads in a row, fairly high. You conclude that if the null hypothesis is correct, then a perfect line-up is pretty unlikely and might make you want to re-think your assumption of “no effect.” On the other hand, getting 6 out of 8 right could happen pretty easily if the assumption of no effect were true, so you are left unsure. Acupuncture could still be having an effect but you can’t tell your apparent success from a chance event.
You can find more details of the Fisher Exact Test in any elementary statistics book, but in this instance we just want to emphasize the keys to the argument rest on two things. The first is that the randomization method is known and controlled by the investigator. He or she can thus calculate the probabilities of the various combinations for the eight patients. The second is the assumption that the drug has no effect whatsoever. Rosenbaum calls randomization with a null hypothesis the most “uncluttered” experiment because it requires the fewest assumptions (only one, but it is a pretty stringent one; more on that in the next post). If you want to allow for an effect and estimate its size and precision (how effective is acupuncture, for example), then you have to make some additional assumptions. Remember, too, that real experiments have all sorts of technical and logistical flaws, such as loss to follow-up, measurement error, etc., so this really is the minimal and ideal case. It is also small enough that we can get an exact number for how many possible treatment assignments there are. As sample sizes increase we are forced to use approximations, although the approximations are frequently very, very good.
That’s quite enough for one post. We’ll finish up randomization in the next one and then conclude by returning to our challenge example.