Randomized trial versus observational study challenge, VII: randomization, second part

[Previous installments: here, here, here, here, here, here]

Last installment was the first examination of what "randomized" means in a randomized controlled trial (RCT). We finish up here by calling attention to what randomization does and doesn't do and under what circumstances. The notion of probability here is the conventional frequentist one, since that's how most RCTs are interpreted by users. Before we launch into the details, let's stop for a minute to see where we've been and why.

We began with a challenge to you, our readers. In the first post of this series we described an uncontrolled, unblinded, small convenience sample trial of clinical efficacy conducted (and later published in a peer reviewed medical journal) by a practicing cardiologist using her patients with refractory hypertension. On the basis of solid mechanistic data she conjectured that an anti-epilepsy drug approved by the FDA for safety and anti-convulsive efficacy would have a beneficial effect on the blood pressure of her patients who so far had not responded to any other treatment. A series of 29 patients with blood pressure measured before and after a short course of treatment showed a clinically and statistically significant reduction in both systolic and diastolic blood pressure. We asked you to tell us why and for what this study or case series (we didn't stipulate what to call it) should be relied upon as producing valid scientific information. What we described, while highly plausible and perhaps even typical of clinical literature, was entirely fictitious and deliberately constructed so as not to have any of the features considered not only desirable but necessary by many Evidence Based Medicine (EBM) advocates. Sophisticated EMB advocates do not represent this extreme, but many EBM consumers do. The idea that unless something has been subjected to an RCT it remains scientifically unproved is a tacitly assumed and prevalent position among clinicians and scientifically savvy non-clinicians. Our ultimate claim is that it is wrong twice: once in rejecting systematically collected reliable information from non-RCTs, including the extreme opposite pole as in our challenge example; and the undue weight given the RCT as a source of evidence.

In order to support these claims we have resorted to a fine grained analysis of what the objective of any clinical study is and the problems in meeting this objective. After a brief examination of some of your initial responses in part 2 of this series, we moved on in part 3 to ask what we mean by any treatment being effective. This is a much deeper question about what we mean by causation and we spent parts 3, 4 and 5 discussing various aspects of this knotty question. It seems to produce an insoluble problem involving a comparison with something that didn't happen and can never happen, the counterfactual. In part 6 we began our discussion of how randomization provides one kind of work around for this problem, although it is not the only reasonable work around. The clinical case series of our challenge example is another one involving a cross-over design.

It is not possible to solve the counterfactual problem unless you make some fundamental assumptions, and last time (part 6) we discussed how making one key assumption allows randomization to provide a clean interpretation of results in terms of a frequentist notion of probability. Statisticians call that key assumption the null hypothesis, but it turns out there is more than one way to interpret what that means. Two of the giants of the foundations of applied statistics, R.A. Fisher (who introduced randomization into experimental design in his text of 1935) and Jerzy Neyman, disagreed about what it meant and we need to discuss this disagreement because it is highly pertinent to our challenge question and indeed how we view RCTs in general. In part 6 we gave Fisher's view (embodied in the much used Fisher Exact Test) that the null hypothesis of "no effect" of the treatment literally meant "no effect." What this means is that if you give the treatment (vaccine, drug, acupuncture, prayer, etc.) there is no effect of any kind on the unit of treatment (usually a person). It is as if they got nothing.

Neyman's view was that "no effect" meant "no effect, on average." It could be that some people were helped and some people hurt by the drug, but when all was said and done the pluses and minuses canceled out. What this leaves open is that there might be sub groups that are benefitted but the benefits are submerged in a population unselected for those who might benefit (or for those who might be hurt, for that matter). To take an extreme example, it would be as if you were testing a prostate cancer drug with survival as an outcome in a population of people that might include males and females and most of whom didn't have prostate cancer. Neyman's perspective makes a great deal of sense when it comes to making general recommendations about an unselected population. But Fisher's idea is closer to the intuitive idea of causation and might make more sense in terms of a doctor and her individual patient, especially as there might be additional information about the patient not considered in the RCTs that could modify the response to treatment.

If you are unpersuaded by this, let me try another illustration of the difference. Everyone knows that even legitimate casinos or state lotteries are tilted in favor of the House. Let's say, though, that there is a casino that is just there for fun and that the odds of winning and losing in the long run are the same. If you keep playing, in the long run you are almost certain to break even (assuming you and the House always have sufficient funds so that neither of you can ever be ruined). This is the null hypothesis and Fisher and Neyman agree on this. But Fisher goes much farther. He doesn't just say you break even in the long run. He says you can never win or lose, even on a single play, while Neyman says that sometimes you'll win and sometimes you'll lose. Fisher's casino isn't any fun. Neyman's might be, especially if you quit when you are ahead. [Afterthought added later: Perhaps a better way to visualize this is to imagine a chain of break even casinos with tens of thousands of players each night. In a Neyman casino you may win or lose or break even, but the average casino take for the night averaged over the tens of thousands of players will be zero and the average of individual winnings will be zero, although some players will have won and some will have lost. In a Fisher casino, every game ends in a draw with the amount wagered equal to the amount won on each and every game for each and every player. Average results are the same in each casino. It's clear that no one would go to a Fisher casino, but an awful lot of people go to sub-Neyman casinos where on average they are guaranteed to lose. It obviously makes a difference whether you conceive of an RCT as a Fisher or Neyman situation.]

But we can't differentiate Fisher's world from Neyman's on the basis of an RCT because an RCT can only display average responses of the randomized groups. Since the average response under the null hypothesis is the same for Fisher and Neyman it provides no information useful in telling them apart. But in Neyman's world, randomization has one more advantage. Simple uniform randomization guarantees that the estimation of average treatment effects are unbiased (in the statistical sense that the estimate of the average group effect is the same as the average of its individual treatment effects). So Neyman gets a bonus because he is only interested in averages but Fisher isn't.

What happens if you don't assume there is no difference, even on average? Then you have to make additional assumptions about the form of the difference. Some of these assumptions aren't always harmless. For example even a simple assumption that treatment results in a constant difference (plus or minus) from the untreated state will likely be tractable only if you assume there is no affect of one treatment pattern compared to another (what Rosenbaum calls no interference between units and Rubin calls the Stable Unit Treatment Value Assumption). In words this means that if you treat one person, the outcome is not affected by who else is treated. This is OK for the blood pressure example but wouldn't work for a vaccine trial in a specific population because of herd immunity effects.

The bottom line here is that even simple randomization requires some assumptions which may or may not be true. It is one effective work around for the counterfactual problem but there's no free lunch. Next time we'll finish with the challenge by wrapping up our example and making one additional point, our zinger.

At least we think it's a zinger. You may think's it's an anti-climax. It's a long way to Tipperary.

Categories

More like this

[Previous installments: here, here, here, here, here] After a detour through the meaning of causation and the need to find a substitute for what can't, in principle, be observed (the counterfactual), we are now ready to consider what many of you might have thought would be the starting point,…
I've noticed that whenever I have the temerity to suggest (e.g., here and here) that maybe the word of the Cochrane Collaboration isn't quite the "last word" on the subject and indeed might be seriously flowed, I hear from commenters and see on other sites quelle horreur reactions and implications…
[Previous installments: here, here, here, here] We'd like to continue this series on randomized versus observational studies by discussing randomization, but upon reviewing comments and our previous post we decided to come at it from a slightly different direction. So we want to circle back and…
Let me start with an apology. This post is again fairly long (for a blog post). Blog readers don't like long posts (at least I don't). But once I started writing about this I was unable to stop at some intermediary point, although I might have made it more concise and less conversational. I haven't…

[Fisher] says you can never win or lose, even on a single play, while Neyman says that sometimes you'll win and sometimes you'll lose.

I'm not sure Fisher would say that - it would be equivalent to saying that the casino was deterministic.

I think the distinction you're trying to make is between being identical and being (in technical jargon) exchangeable. Fisher's casinos are all exactly the same. Neyman's are different, but the differences are random and you can't tell which is good or bad (well, until you've lost your shirt...).

Hm. Let's try to break a different analogy - Fisher's an average lottery player, Neyman's an average poker player.

I don't know that your casino example is entirely accurate. Under the null distribution, your outcome is the same (on average) whether or not you receive the treatment. There is still some variability possible, and you could win (or lose) just by luck. With Neyman, you might benefit (or lose) (on average) if you get the treatment vs. not, but that benefit might not be seen at the study population level, since only a few individuals may see that benefit or loss. A better casino analogy might be a two armed slot machine, one representing treatment, and the other representing no treatment. Under Fisher, you have the same odds of winning under the null whether you pull the left or the right arm no matter what machine you go to. But under Neyman, some slot machines may favor one arm over the other, but on average there is no advantage either way.

Not sure what the zinger may be, but one issue I wonder about is the assumption that randomization balances all potential confounders, measured and unmeasured. Analytical methods commonly seen (and expected) in observational studies, mainly multiple regression models of one sort or another, are not commonly seen in RCTs.

So the issue of statistical model building is of some interest to me. While this issue may be peripheral to this series of postings, I would welcome some kind of discussion of how you remove confounders in both observational studies and RCTs (other than assuming that the problem does not exist in the latter).

I was not quite following your description of Fisherâs interpretation of the null hypothesis. It made it sound as if the effect is zero in each subject in the sample. But Fisher could never have invented ANOVA if he had not compared within-group variation from between-group variation. If the null hypothesis means that no one is affected by the exposure, then within-group variation and between-group variation would both be zero.

I think that the characterization of Fisherâs concept of the null hypothesis needs a bit of clarification. Mighty interesting stuff, though!

By Ed Whitney (not verified) on 27 Jan 2010 #permalink

Ed: No, you understood me. That is what Fisher thought. It doesn't mean there is no variation. Even in the Lady Tasting Tea there is variation caused by who knows what. So the variation is unaffected by treatment assignment but may still exist for all sort of reasons. But it is unrelated to treatment assignment. Fisher believed that "no effect" really meant "no effect" on each individual unit, i.e., that treatment assignment didn't change the outcome. For Neyman it was "no average effect." Rosenbaum is particularly good on this and I recommend his text, Observational Studies.

I figured it must have something to do with the word "effect." The Lady Tasting Tea book is at the bookstore and the Rosenbaum is not, so I will pick up the former on the way home.

So how's about them assumptions concerning confounding in RCTs? Do they really all wash out in the randomization? Should RCTs be doing multivariable regression? Do regression models have anything to do with reality as it is found in nature? I don't need a big discussion of these when you have lots on your plate, but the first 2 questions seem relevant to the discussion of randomization issues.

By Ed Whitney (not verified) on 27 Jan 2010 #permalink

Eek. I wrote a post here that was held up in moderation, and now hasn't appeared. Is revere still going through the moderation queue, or is there another problem?

Bob: Just checked the moderation queue and it was empty but there were 3 legit comments (including yours) in the publisher's spam filter. Apologies to all. I took care of them. I'll try to check but I am distracted by the grant writing.

Ed: Here's the relationship between randomization and confounding. If you think of confounding as being the hidden or unrecognized differences between the treated and control groups, you can still have differences after randomization. The larger the groups the less likely the chance of big differences, but they can still exist. The value of randomization is that you can use statistical tests to give you a handle on how big a chance that you had "bad luck" in the randomization, which also allows you to compare it to the effect of treatment. In fact, that's why we can use these tests to say the results are "are not likely due to chance."