Randomized trial versus observational study challenge, VI: randomization, first part

[Previous installments: here, here, here, here, here]

After a detour through the meaning of causation and the need to find a substitute for what can't, in principle, be observed (the counterfactual), we are now ready to consider what many of you might have thought would be the starting point, randomization. It's a surprisingly difficult topic and this post will probably be more challenging for non statisticians, but I feel confident you don't need to be an expert to understand it.

First a quick recap. If you want to know if mammography screening will prolong the life of a woman under the age of 40 (Jane Doe), ideally you would first screen Ms. Doe, observe the outcome, and then turn the clock back and repeat. Any difference in time of death is due to mammography (even one minute). Since we can't do this nor can we find an exact substitute for Jane Doe, we consider some work arounds. The first would be a whole population of Jane Does. Can we find a substitute for them? We want to make sure that all of the women in the population are Jane Does, at least in the minimal sense that they are at risk for getting breast cancer. In randomized clinical trials there is usually some attempt to make the study group representative of the target population (in this case, at risk women under the age of 40), but it is rare to try to make them representative by random sampling of the population. To do this you would need a list of all Jane Does to start from and this isn't available. But it turns out it isn't necessary for validity of the trial, although it may affect how generalizable the results are.

Representativeness isn't where randomization comes in. So let's move on. If you have a group of candidate Jane Does you might consider splitting them in two, screening one group and leaving the other group unscreened. Is the unscreened group an adequate substitute for what you can't have, the population of screened Jane Does with the clock turned back so you can see what happens when they are unscreened? Maybe, but unlikely. You have no guarantee that the newly formed unscreened subgroup will be an adequate substitute. Nothing will make the group the same as the screened group, not even randomization. People differ in all sorts of ways from each other even within the screened group, which means you also can't get a group, no matter how, that is identical to the screened group. This is obvious once you think about it, but there is a common misunderstanding that randomization makes the two groups the same. It doesn't, nor is it required that it do so. Here is how statistician Paul Rosenbaum puts it:

First, experiments [here, randomized clinical trials] do not require, indeed cannot reasonably require, that experimental units be homogeneous, without variability in their responses. Homogeneous experimental units are not a realistic description of factory operations, hospital patients, agricultural fields. Second, experiments do not require, indeed, cannot reasonably require, that experimental units be a random sample from a population of units. Random samples of experimental units are not the reality of the industrial laboratory, the clinical trial, or the agricultural experiment. Third, for valid inferences about the effects of a treatment on the units included in an experiment, it is sufficient to require that treatments be allocated at random to experimental units - these units may be both heterogeneous in their responses and not a sample from a population. Fourth, probability enters the experiment only through the random assignment of treatments, a process controlled by the experimenter. A quantity that is not affected by the random assignment of treatments is a fixed quantity describing the units in the experiment. (Rosenbaum, P, Observational Studies, 2nd Ed.)

So if randomization isn't making the two groups "the same," what is its function? We'll take Rosenbaum's approach and follow the ideas of the originator of randomization in experimental design, R.A. Fisher, one of the giants of 20th century applied statistical theory. We'll change Fisher's subject (an English lady who claimed she could tell by tasting if milk or tea had been added first), but use Fisher's numbers. For variety we'll change the example once again and consider 8 people being treated for low back pain with acupuncture (I like to be provocative). We will randomly assign 4 low back sufferers to acupuncture done according to acupuncture theory and 4 to sham acupuncture (the needles are in inappropriate places for back pain). We are going to make only one assumption: that acupuncture has absolutely no effect on back pain. This is an assumption most of you won't have a problem with, but it is a surprisingly potent hypothesis, something statisticians call the "null hypothesis." If it turns out to be an unlikely assumption given the results, it will imply that acupuncture has some effect. In some views you could substitute any alternative hypothesis for the null, but for Fisher the null has a privileged position. We'll explain why, shortly, and in the next post discuss some pertinent differences of Fisher's view from other views.

To continue, at the end of a week we score each person in a way that indicates whether they have improved. Assume this is done blind to treatment and by some reliable scoring measure for evaluating lower back pain. Let's say we find that four improved and four didn't and we want to know if there is a relationship between improvement and proper acupuncturing. Don't worry that this trial is small. We are illustrating a principle and, as noted, these are the same numbers Fisher used. The statistical test we will explain is widely used for small samples and is called the Fisher Exact Test.

Let's pause to summarize what we are working with. We have 8 people, all presumably candidates for having their low backsymptoms improved by acupuncture. We have randomly assigned 4 of them to treatment, 4 to placebo. They don't know which treatment they got and the person who is scoring their symptom improvement doesn't either. The question is whether the results are likely, given the null hypothesis.

The results are summarized in the form of a statistic. A statistic is a number derived from the data. A mean value (also called an average or expected value) is a statistic. So is a standard deviation. But so are simpler numbers, like the number of treated back pain patients who improved. That is the statistic Fisher used. Where does chance or probability come into this scheme? As Rosenbaum notes, only in one place: the random procedure that assigned treatments to four subjects and placebos to four others. It was completely under the control of the investigator and the probabilities of any possible treatment assignment can be calculated (or approximated) in advance.

After the randomization, everything is determined, at least if the null hypothesis is true. This is a key point. Whether a person improves may depend on a lot of things but the one thing it doesn't depend on is the treatment, assuming the null hypothesis. Why? The assumption says that for any particular person the outcome is the same (and fixed) whether they got real or sham acupuncture. This is the way Fisher's randomization together with the null hypothesis works around the counterfactual. It is a type of thought experiment that turns the clock back by saying that if the only thing that is different is the treatment, but the treatment doesn't do anything, then nothing is different.

Maybe you understand this immediately, but it is a hard thing for many of us to get our heads around, so let me belabor the point with another way to understand it. Suppose whether an acupunctured (sham or real) person improved or not was decided by the flip of a coin (where the sides were labeled improved or unimproved, not heads or tails). Each person has his or her own coin but the coins might differ for all sorts of reasons, just as people do. Since flipping is accomplished by the treatment but the treatment doesn't do anything, flipping the coin doesn't change the outcome. The null hypothesis is like giving a person a coin that has improved on both sides or unimproved on both sides. Flipping it (acupuncturing, sham or otherwise) can't change the outcome. If you are tempted to say that just by virtue of being in the trial a person's outcome might change, you are really saying that the treatment (sham or otherwise) does have an effect, no matter how small or insignificant, which is contrary to the null hypothesis. Fisher's null is sharp and strict. No effect really means no effect.

Once you grasp this, Fisher's analysis makes sense and is really quite ingenious. Everything but the randomization was fixed independently of the study. You have 8 people and they have labels on them, "improved" or "not improved" according to the examination by the blinded evaluator. What wasn't fixed ahead of time was the result of the randomization, which produces another label on each of them: "treated" or "placebo." Those treatment assignment labels were where randomization came in to it. We are interested in how often the two sets of labels "line up," that is, how many times did a treated person improve and an untreated one not improve? There are a lot of possible statistics we could use here, even in just counting the different categories, because once you fix the number treated and untreated and the number improved and not improved, you only need to determine one other category, like the number of correct label line-ups (treated improved plus placebo-unimproved) or even just the number of treated who are improved. Either determines all the rest of the categories (for those more experienced with data tables, you can satisfy yourselves this is true by making a 2x2 table with 4 fixed marginals -- 4 treated, 4 placebos, 4 improved, 4 unimproved; you have four empty boxes in the middle; fixing just one of the interior boxes is enough to fix all the others). We'll use the number of correct assignments as our statistic. The maximum number is 8, that is, a perfect line-up. How likely is that to happen under a random assignment of 4 treated and 4 placebo?

The answer is a simple problem in combinatorics, "eight choose four," or 70 distinct ways to label 8 people as 4 treated and 4 placebo. Think of this as a big fish bowl full of label assignments, each one a strip of paper with giving the labels of 8 people. There will be 70 different strips (label line-ups -- example of one such assignment: Jim/T, Joe/P, Susie/P, Alice/P, Jane/T, Bill/T, Harold/T, Jennifer/P). You reach into this bowl and pull one out "randomly." What is the chance you got the single one that was the perfect match to your results? One chance in 70. Suppose, however, the line-up wasn't perfect but you got 6 out of 8? (you can't get 7 out of 8 because if you make one misalignment you automatically make a second). It's fairly easy to calculate that there are 16 assignments in the fish bowl where there are 6 of 8 correctly aligned with the treatment. 16/70 is almost 25% (give or take). That's about the chance of flipping a coin twice and getting 2 heads in a row, fairly high. You conclude that if the null hypothesis is correct, then a perfect line-up is pretty unlikely and might make you want to re-think your assumption of "no effect." On the other hand, getting 6 out of 8 right could happen pretty easily if the assumption of no effect were true, so you are left unsure. Acupuncture could still be having an effect but you can't tell your apparent success from a chance event.

You can find more details of the Fisher Exact Test in any elementary statistics book, but in this instance we just want to emphasize the keys to the argument rest on two things. The first is that the randomization method is known and controlled by the investigator. He or she can thus calculate the probabilities of the various combinations for the eight patients. The second is the assumption that the drug has no effect whatsoever. Rosenbaum calls randomization with a null hypothesis the most "uncluttered" experiment because it requires the fewest assumptions (only one, but it is a pretty stringent one; more on that in the next post). If you want to allow for an effect and estimate its size and precision (how effective is acupuncture, for example), then you have to make some additional assumptions. Remember, too, that real experiments have all sorts of technical and logistical flaws, such as loss to follow-up, measurement error, etc., so this really is the minimal and ideal case. It is also small enough that we can get an exact number for how many possible treatment assignments there are. As sample sizes increase we are forced to use approximations, although the approximations are frequently very, very good.

That's quite enough for one post. We'll finish up randomization in the next one and then conclude by returning to our challenge example.

More like this

[Previous installments: here, here, here, here] We'd like to continue this series on randomized versus observational studies by discussing randomization, but upon reviewing comments and our previous post we decided to come at it from a slightly different direction. So we want to circle back and…
[Previous installments: here, here, here, here, here, here] Last installment was the first examination of what "randomized" means in a randomized controlled trial (RCT). We finish up here by calling attention to what randomization does and doesn't do and under what circumstances. The notion of…
I heard it again the other night. One of the TV chin strokers talking about this poll or that poll showing Obama (or McCain) ahead with a "statistically insignificant" lead, and I thought to myself, no one who knew much about statistics would use a phrase like that. Strictly speaking, while there…
Marilyn Mann pointed me to an interesting post by David Rind over at Evidence in Medicine (thanks!). It's a follow-on to an earlier post of his about the importance of plausibility in interpreting medical literature, a subject that deserves a post of its own. In fact the piece at issue, "HIV…

The quote from Rosenbaum "Fourth, probability enters the experiment only through the random assignment of treatments, a process controlled by the experimenter..." lays the groundwork for a discussion between Bayesians and classical statisticians. Revere makes a lot of this sentence, but there is an alternative interpretation of the role of randomization.

To say that probability enters only as a result of the randomization claims that, given the results of the randomization, everything else is deterministic. At the moment of the coin toss, we are ignorant of the outcome of Sally's acupuncture. I believe Rosenbaum would say that her response is determined, we just don't know it yet. As we think about how we might model our beliefs about her (as yet unknown) response, we are led to set a probability distribution on it, generated completely by our ignorance and our beliefs. The probability distribution thus generated on the outcome of the total experiment gives us the probability apparatus we need to analyze it, even if it were not randomized at all. Of course, the resulting analysis is different from the Fishers Exact Test.

I don't mean to imply that the randomization is not useful, after all it helps to prevent bias. However, I think it goes too far to think of it as providing THE probability structure of the experiment.

Tom S: I'm not sure if we are disagreeing or not. When you speak of the probability generated by the total experiment, you are referring to the probability distribution of the statistic, R. Rosenbaum does say that the response statistic, R =Z^T.r, is a random variable because it depends on Z, the treatment allocation. r is fixed, however. It is Z that varies "randomly." For a sharp null, the coin toss does nothing. It might as well not have been tossed.

Hi Revere,

I have really enjoyed this series, but I have a dumb question. How does one category determine all the rest? If I know # of treated who improved, then obviously I know # of treated who did not improve, but how do I know anything about the placebo group?

Preliminary googling did not elucidate what data tables or fixed marginals are.

Thanks!

pd,

It's easiest to see if you do it yourself. Make a 2x2 table. Lable the columns "Improved" and "Unimproved." Lable the rows "Treated" and "Untreated." From revere's numbers, each column and each row must add to 4 (4 treated, 4 untreated, 4 improved, 4 unimproved). Those are the "fixed marginals" (the row and column sums that you put around the margins of the table).

Now put a number between 0 and 4 into any box, and try to figure out what numbers could then go into the other 3 boxes, while ensuring that rows and columns all add to 4. You'll find that there's only one way to do so.

pd: Not a dumb or self-evident at all, which is why I sketched (inadequately) the answer to your question. Let me expand on it a bit. You'll need paper and pencil.

Make a table that has 2 rows and 2 columns. The two rows are treatment and placebo, the 2 columns are improved, not improved. So you have 4 boxes for the various combinations. Now at the end of each row enter the number of treated and placebo (4 and 4 in this case, decided on ahead of time) and at the bottom of each column the number of improved and not improved (again 4 and four, but the numbers could be any two numbers that add up to 8 but dependent on the outcome of the experiment). This is what you start with in your analysis.

Now pick any empty box in the middle, let's say the top left, which we'll call treated and improved. Fill in a number (let's say it's 3 people who got real acupuncture and improved). Look at your table after putting 3 in the box at upper left. That forces all the other boxes. For example, the one on the upper right now must be 1 (because the sum in the row is 4) and the box under it must also be 1 (because the sum of the column is also 4) and that forces the last box, the one on the lower right also to be 3.

To recap, you start out with the two column totals (4 and 4) and you do the study and you obtain the two row sums (in this case 4 and 4, but it could be 6 and 2 or 5 and 3 or whatever as long as the numbers add up to 8). As long as the row and column totals are fixed, it only takes filling in one of the inner boxes to get the other 3 (or in the case of the statistic I used, the sum of the upper left box and the lower right box will also work, although they will always be equal).

The easiest way to see this is to make some tables with fixed (marginal) rows and columns and try it for yourself. You'll see that filling in a single box in the middle is sufficient (which means you could also use as a statistic any combination of the boxes).

Hope that helps.

qetzal & revere:

Thanks very much! I was puzzling over how there could possibly be a general solution, and the answer is clearly that there isn't - you do actually need to know the results (4 improved, 4 unimproved). You said it clearly in the post and I did reread it but I still managed to miss it. Sorry for taking your time on such a silly question!

pd: No, it wasn't silly and others probably were wondering, too, so I'm glad qetzal and I answered you (although if i'd known qetzal was going to do it, I'd have just let his shorter but good explanation suffice). What Fisher saw was that if you randomly allocate treatments to 4 and 4 (thus fixing the columns) and the treatment does nothing (this fixes the rows since people improve or not independent of the row they are in and you might as well not have acupunctured them, real or sham), that the way the interior boxes vary is the same as something called the hypergeometric distribution (sampling without replacement) and could be calculated ahead of time. The probability is easy to calculate for small cases but for large samples the numbers get big very fast (there are 9 factorials involved) so we use good approximations. The chi-squared test for 2x2 tables is the best known.

If you want to figure out the probability of a particular table for yourself according to the hypergeometric distribution, take the product of the four factorials in the marginals and divide it by the product of the factorials of the grand total and each of the four inner boxes. In the case of 3 in the upper left box this is 4! x 4! x 4! x 4!/(8! x 3! x 1! x 3! x1!) which is something like .23.

Quite by chance today I read Richard Wright's purportedly true account in Harper's (12/1942) of . . . uh . . . modifications to several randomly assigned medical experiments at Chicago's Michael Reese Hospital one winter day in 1932.

http://www.harpers.org/archive/1942/12/0020407

Though written for humor (and circa Reid's formative eras), it should give us a bit of pause. I have observed a few variations on the theme, generally done to PIs who undervalued, and underestimated, underlings.

Sorry, Revere, I don't intend to interrupt your excellent CME. The article seemed just too apropos . . . . though definitely tangential . . .

I still can't agree that the only source of probability is the randomization. Even given the result of the randomization, there is still uncertainty in what the response will be, and that uncertainty gives rise to a probability distribution itself. If we toss a coin and conceal the result, the chance of a head is still 1/2, even though the event has already happened. Our uncertainty about it (as long as we have it concealed) generates the probability distribution.

In the acupuncture study, a perfectly legitimate statistical analysis can be generated, albeit different from the Fishers Exact Test, even if treatment assignment was not random at all (important qualifier below). This is a Bayesian analysis and is valid regardless how treatments were assigned.

Qualifier: Randomization prevents bias. If you don't randomize, then I might choose to construct a different "model" of the experiment, which would lead to a different analysis. And that leaves open the question, maybe to be dealt with in the next post, what do you do if the 4 sickest patients get randomized to one treatment?

Tom S: I think the reason you are having a problem is that you have chosen a particular concept of probability (subjective). I discussed this a bit in another post (not in this series) on Neyman-Pearson statistics. In the next post I will get at this obliquely by discussing the difference between Fisher's idea of "no effect" and Neyman's.

However I didn't want to mix in too many things in this series. The problem is that once you pull a thread like "what does probability mean?" the whole thing starts to unravel and I originally had the specific challenge example in mind so wanted to discuss things from the usual viewpoint of EBM people. These ideas are frequentist in origin and while I tend to be a Bayesian with respect to my notions of probability, I decided that this approach lends itself to trying to deconstruct the clinical trial challenge example for the benefit of EBM adherents . There are probably other ways to do it, maybe even better ones, but this is the way I've chosen by following the thread of the argument starting with the usual critiques of the challenge example.

You raise interesting questions, which, I regret to say, I must put aside for the moment.

I understand, Revere, continue with your excellent series, and we'll have an opportunity to return to this another day.

We need to look at random versus fixed effects now, before we generalize the results;
See
http://www.meta-analysis.com/

Topic of the month
Fixed vs. random effects in meta-analysis

â¢What are fixed effect and random effects models?
â¢How do they differ from each other?
â¢Which model should you be using?
â¢What is the most common mistake people make when selecting a model?
Overview
One goal of a meta-analysis will often be to estimate the overall, or combined effect.
If all studies in the analysis were equally precise we could simply compute the mean of the effect sizes.  However, if some studies were more precise than others we would want to assign more weight to the studies that carried more information. This is what we do in a meta-analysis.  Rather than compute a simple mean of the effect sizes we compute a weighted mean, with more weight given to some studies and less weight given to others.
The question that we need to address, then, is how the weights are assigned.  It turns out that this depends on what we mean by a âcombined effectâ.  There are two models used in meta-analysis, the fixed effect model and the random effects model.  The two make different assumptions about the nature of the studies, and these assumptions lead to different definitions for the combined effect, and different mechanisms for assigning weights.

spinoff of
http://www.cochrane.org/index.htm

These are not lightweight or in the weeds considerations. They are fundamental because if you can not generalize the results past the immediate patients or group or conditions, the study is not as valuable.

The misunderstanding of fixed versus random effects is a problem right now with FAA for their environmental impact studies, the Florida department of environmental protection, and others, like in the state Florida and county health departments.

Dwight Hines

By dwight hines (not verified) on 17 Jan 2010 #permalink

Dwight:: Not there yet. I'm taking another path although I suspect we'll wind up where you would wind up, but, I hope, without the technical details. At any rate, that's the way I've decided to do it for this audience.