[Previous installments: here, here, here, here, here, here]

Last installment was the first examination of what “randomized” means in a randomized controlled trial (RCT). We finish up here by calling attention to what randomization does and doesn’t do and under what circumstances. The notion of probability here is the conventional frequentist one, since that’s how most RCTs are interpreted by users. Before we launch into the details, let’s stop for a minute to see where we’ve been and why.

We began with a challenge to you, our readers. In the first post of this series we described an uncontrolled, unblinded, small convenience sample trial of clinical efficacy conducted (and later published in a peer reviewed medical journal) by a practicing cardiologist using her patients with refractory hypertension. On the basis of solid mechanistic data she conjectured that an anti-epilepsy drug approved by the FDA for safety and anti-convulsive efficacy would have a beneficial effect on the blood pressure of her patients who so far had not responded to any other treatment. A series of 29 patients with blood pressure measured before and after a short course of treatment showed a clinically and statistically significant reduction in both systolic and diastolic blood pressure. We asked you to tell us why and for what this study or case series (we didn’t stipulate what to call it) should be relied upon as producing valid scientific information. What we described, while highly plausible and perhaps even typical of clinical literature, was entirely fictitious and deliberately constructed so as not to have any of the features considered not only desirable but necessary by many Evidence Based Medicine (EBM) advocates. Sophisticated EMB advocates do not represent this extreme, but many EBM consumers do. The idea that unless something has been subjected to an RCT it remains scientifically unproved is a tacitly assumed and prevalent position among clinicians and scientifically savvy non-clinicians. Our ultimate claim is that it is wrong twice: once in rejecting systematically collected reliable information from non-RCTs, including the extreme opposite pole as in our challenge example; and the undue weight given the RCT as a source of evidence.

In order to support these claims we have resorted to a fine grained analysis of what the objective of any clinical study is and the problems in meeting this objective. After a brief examination of some of your initial responses in part 2 of this series, we moved on in part 3 to ask what we mean by any treatment being effective. This is a much deeper question about what we mean by causation and we spent parts 3, 4 and 5 discussing various aspects of this knotty question. It seems to produce an insoluble problem involving a comparison with something that didn’t happen and can never happen, the counterfactual. In part 6 we began our discussion of how randomization provides one kind of work around for this problem, although it is not the only reasonable work around. The clinical case series of our challenge example is another one involving a cross-over design.

It is not possible to solve the counterfactual problem unless you make some fundamental assumptions, and last time (part 6) we discussed how making one key assumption allows randomization to provide a clean interpretation of results in terms of a frequentist notion of probability. Statisticians call that key assumption the null hypothesis, but it turns out there is more than one way to interpret what that means. Two of the giants of the foundations of applied statistics, R.A. Fisher (who introduced randomization into experimental design in his text of 1935) and Jerzy Neyman, disagreed about what it meant and we need to discuss this disagreement because it is highly pertinent to our challenge question and indeed how we view RCTs in general. In part 6 we gave Fisher’s view (embodied in the much used Fisher Exact Test) that the null hypothesis of “no effect” of the treatment *literally meant “no effect.”* What this means is that if you give the treatment (vaccine, drug, acupuncture, prayer, etc.) there is no effect of any kind on the unit of treatment (usually a person). It is as if they got nothing.

Neyman’s view was that “no effect” meant “no effect, on average.” It could be that some people were helped and some people hurt by the drug, but when all was said and done the pluses and minuses canceled out. What this leaves open is that there might be sub groups that are benefitted but the benefits are submerged in a population unselected for those who might benefit (or for those who might be hurt, for that matter). To take an extreme example, it would be as if you were testing a prostate cancer drug with survival as an outcome in a population of people that might include males and females and most of whom didn’t have prostate cancer. Neyman’s perspective makes a great deal of sense when it comes to making general recommendations about an unselected population. But Fisher’s idea is closer to the intuitive idea of causation and might make more sense in terms of a doctor and her individual patient, especially as there might be additional information about the patient not considered in the RCTs that could modify the response to treatment.

If you are unpersuaded by this, let me try another illustration of the difference. Everyone knows that even legitimate casinos or state lotteries are tilted in favor of the House. Let’s say, though, that there is a casino that is just there for fun and that the odds of winning and losing in the long run are the same. If you keep playing, in the long run you are almost certain to break even (assuming you and the House always have sufficient funds so that neither of you can ever be ruined). This is the null hypothesis and Fisher and Neyman agree on this. But Fisher goes much farther. He doesn’t just say you break even in the long run. He says you can never win or lose, even on a single play, while Neyman says that sometimes you’ll win and sometimes you’ll lose. Fisher’s casino isn’t any fun. Neyman’s might be, especially if you quit when you are ahead. [**Afterthought** added later: Perhaps a better way to visualize this is to imagine a chain of break even casinos with tens of thousands of players each night. In a Neyman casino you may win or lose or break even, but the average casino take for the night averaged over the tens of thousands of players will be zero and the average of individual winnings will be zero, although some players will have won and some will have lost. In a Fisher casino, every game ends in a draw with the amount wagered equal to the amount won on each and every game for each and every player. Average results are the same in each casino. It’s clear that no one would go to a Fisher casino, but an awful lot of people go to sub-Neyman casinos where on average they are guaranteed to lose. It obviously makes a difference whether you conceive of an RCT as a Fisher or Neyman situation.]

But we can’t differentiate Fisher’s world from Neyman’s on the basis of an RCT because an RCT can only display *average* responses of the randomized groups. Since the average response under the null hypothesis is the same for Fisher and Neyman it provides no information useful in telling them apart. But in Neyman’s world, randomization has one more advantage. Simple uniform randomization guarantees that the estimation of average treatment effects are unbiased (in the statistical sense that the estimate of the average group effect is the same as the average of its individual treatment effects). So Neyman gets a bonus because he is only interested in averages but Fisher isn’t.

What happens if you don’t assume there is no difference, even on average? Then you have to make additional assumptions about the form of the difference. Some of these assumptions aren’t always harmless. For example even a simple assumption that treatment results in a constant difference (plus or minus) from the untreated state will likely be tractable only if you assume there is no affect of one treatment pattern compared to another (what Rosenbaum calls no interference between units and Rubin calls the Stable Unit Treatment Value Assumption). In words this means that if you treat one person, the outcome is not affected by who else is treated. This is OK for the blood pressure example but wouldn’t work for a vaccine trial in a specific population because of herd immunity effects.

The bottom line here is that even simple randomization requires some assumptions which may or may not be true. It is one effective work around for the counterfactual problem but there’s no free lunch. Next time we’ll finish with the challenge by wrapping up our example and making one additional point, our zinger.

At least we think it’s a zinger. You may think’s it’s an anti-climax. It’s a long way to Tipperary.