[Previous installments: here, here, here, here]
We’d like to continue this series on randomized versus observational studies by discussing randomization, but upon reviewing comments and our previous post we decided to come at it from a slightly different direction. So we want to circle back and discuss counterfactuals a little more, clarifying and adapting some of what we said for the coming randomization discussion.
Let me change the example to a more recent controversy, screening mammography for breast cancer. Should women under 40 get routine screening given that there is said (on the basis of RCTs) to be small benefit on the one hand and on the other, putative or real risks? I don’t want to get into the actual details of this dispute but just look at the logic without concerning ourselves with the size of the benefit. Let’s simplify to asking if there is any benefit to mammography at all. This is like the question about our blood pressure drug: does it work? I’m changing the example because I think what is involved is easier to see when we talk about screening.
Consider what this means in the case of Jane Doe who is considering whether to get a mammogram or not. She wants to know if it will prolong her life in quantity and quality. Let’s do a thought experiment. The ideal experiment would be this. We screen Jane Doe, follow her for a number of years and then use an appropriate measure of outcome, let’s say age at death (we are not only including breast cancer death but risk of death from any cause). Then, we turn the clock back and not screen Jane Doe and do the same thing and compare them. If they are different, most of us would say that’s what it means for “mammography to work.”
In the real world, though, only one of these scenarios can happen. The thing that gives it meaning for causality, the comparison with the impossible anti-trial, can’t happen. This is a conundrum. We need a work around. It won’t be perfect but it’s the best we can do. We will claim that there isn’t just one work around but many and you use what you can in terms of feasibility, ethics and resources.
So what are the work arounds for what seems an insoluble situation? Jane Doe can’t be both screened and not screened. They are mutually exclusive. For screening, unlike our blood pressure example, there is no possibility of using a cross-over design, i.e., first screening her (or not screening her) to see what happens for a few years and then not screening her (or screening her) and then seeing what happens for several more years. Her risk changes with time so the later Jane Doe is clearly not equivalent to the earlier one. We don’t have an identical Jane Doe. Even if she had an identical twin, Alice Doe, Jane and Alice will have the same genome but different histories, making them differ in many ways (e.g., one might be a radiologist and the other an accountant; one might live in Denver, the other in Charleston, SC). Not even their genetics will be identical because the genome becomes modified after birth. These “epigenetic” changes are like changing the Preferences on a software program. The underlying program is the same but two users might set things up quite differently after opening the shrinkwrap. The problem with a whole bunch of Jane Does (a population) is no different than one Jane Doe. We can’t both screen them and not screen them. At this point many of you will want to have two populations, one screened and one unscreened and compare them. That’s obviously where we are heading, but before we do, let’s stay with the counterfactual problem just a bit longer.
What if we could turn the clock back for a population of Jane Does? What would we look at? We want something that measures the relevant (for our purposes) differences between the screened and unscreened population. Epidemiologists are adept at finding these measures. It might be total mortality after a suitable follow-up period (incidence proportion), survival after screening, breast cancer mortality per person year of observation (incidence density), etc. Which one we choose may be subject and setting specific and let’s not worry about which one we settle on. We are only interested in the difference in the measure between the screened and unscreened population after turning the clock back. That difference is called a causal contrast (or effect measure(!) or causal parameter). Say we are using the arithmetic difference as the causal contrast. Let’s choose risk of dying from breast cancer and call the risk when the population is screened R1, and R0 the risk when it isn’t screened.
We can observe only one of these, however, because in the real world we can’t turn the clock back. While we want to measure (R1 – R0), we can only observe one of R1 or R0, not both together, so we can’t measure what we want, the causal contrast (the measure of effect). Faced with this, we try to do the next best thing: find a substitute for the unobservable counterfactual population (the one that was unscreened). So far we haven’t said anything about randomization and for good reasons. This is a general framework for all kinds of studies about whether something works or causes disease (etiologic studies). In experimental situations, like clinical trials, the investigator gets to assign the treatment (screened or not screened, drug or no drug). In observational studies the assignments are given to us or the contrast is with some prior experience (e.g., Rind’s HIV or rabies examples mentioned in our last post or our fictitious blood pressure trial). Study design differences are differences are related to how we choose the original population for the study question (the Jane Does), how we choose an appropriate substitute population for the causal contrast (the pseudo Jane Does), how well these decisions represent the world of Jane Does and then all sorts of ancillary decisions about related to costs, study time and other technical factors.
The sampling problem is an added complication. The study group of screened Jane Does is meant to represent the world of Jane Does for whom we want to know if screening is a good thing. The way we have chosen them might or might not be representative — and here that means “good stand-ins” — for that larger group. The same for the unscreened pseudo Jane Does. Let’s call the larger group the target population. So there is a double substitution going on here. One is the pseudo Jane Does (unscreened) for the real Jane Does (screened). The other is both of these populations for the bigger external target population. None of this (so far) involves randomization. It just involves the notion of substitution of one population for another so we can observe something unobservable (in one case the entire population, in the other the counterfactual), thus allowing us to get a causal contrast. If either or both of these substitutes don’t represent the target population or the counterfactual pair, then we run the risk of misreading the contrast. Epidemiologists call [the non comparability of the counterfactual substitute] confounding (or we say the causal contrast is confounded; this is a more general notion of confounding than seen in many textbooks but amounts to the same thing). What it means in plain language is that when we used an imperfect substitute we weren’t really getting an accurate picture of what we would have seen if we’d been able to turn the clock back. We aren’t seeing what we want to (but can’t) see. Our “work around” was faulty.
Let’s consider our blood pressure trial (see previous posts here, here, here, here). The substitute for the same patient but without treatment by the drug was in fact the same patient at a previous time when they were being treated with the usual therapies and not responding. This would seem to be a pretty good substitute, although one can conjure up reasons why there might be confounding (see the comment threads and also the post by David Rind in Evidence in Medicine). Most of those problems could probably be remedied with a technical fix for this unblinded single-arm trial and wouldn’t require randomization. What is the target population? It could be all refractory hypertensives or just refractory hypertensives in this doctor’s practice or none at all, just a report of what happened to these patients. So the double substitution is visible here, too.
A randomized clinical trial (RCT) is another kind of “work around” for the counterfactual problem. You still have to worry about how good the substitute for the counterfactual is and how representative a substitute the study population is for the target population. Thus when you study seasonal flu vaccine effectiveness in the elderly with an RCT, they are a substitute for the elderly in general. If you extend that to swine flu vaccine in the young, you are changing the nature of the substitute. That may or may not be a reasonable thing to do. It requires justification. You have to check what the target population is and that the study population provides a fair substitute for it. Thus even a pristine RCT isn’t a pure gold standard, but an alloyed gold standard. You always have to check how much base metal there is.
Although randomization doesn’t deal with the target population substitution, there is a reasonable expectation that it will make the counterfactual substitute (the pseudo Jane Does for the screened Jane Does) roughly comparable, i.e., that the substitute will be a good one. Even here, however, complications arise independent of randomization gone bad (i.e., that by odd chance the two groups will not be comparable on some factor of importance). This requires a longer discussion, though, and is best delayed to the next installment. How many installments will there be? I have no idea. I am just following the argument as it goes. I’m surprised it has taken this many.
If you are interested in counterfactuals, this article is useful: Estimating causal effects. Maldonado G, Greenland S. Int J Epidemiol. 2002 Apr;31(2):422-9. It’s a deeper subject than most people give it credit for. As for the next installment, you don’t have to wait. The comment threads are open.