How likely are doping test false positives?

As I mentioned earlier, the current issue of Nature has a perspectives article by Donald Berry, a statistician at the MD Anderson Cancer Center that addresses problems with the current system for testing athletes. I more or less agree with the overall conclusion of the article - there needs to be more detailed scientific study of the tests that are used to see if athletes are using banned substances. Unfortunately, I don't think the entire article was as solid, or as well justified, as his conclusion. In particular, I think he seriously overstates the possibility of false positives.

Berry writes:

Landis seemed to have an unusual test result. Because he was among the leaders he provided 8 pairs of urine samples (of the total of approximately 126 sample-pairs in the 2006 Tour de France). So there were 8 opportunities for a true positive -- and 8 opportunities for a false positive. If he never doped and assuming a specificity of 95%, the probability of all 8 samples being labelled 'negative' is the eighth power of 0.95, or 0.66. Therefore, Landis's false-positive rate for the race as a whole would be about 34%. Even a very high specificity of 99% would mean a false-positive rate of about 8%. The single-test specificity would have to be increased to much greater than 99% to have an acceptable false-positive rate. But we don't know the single-test specificity because the appropriate studies have not been performed or published.

More important than the number of samples from one individual is the total number of samples tested. With 126 samples, assuming 99% specificity, the false-positive rate is 72%. So, an apparently unusual test result may not be unusual at all when viewed from the perspective of multiple tests. This is well understood by statisticians, who routinely adjust for multiple testing. I believe that test results much more unusual than the 99th percentile among non-dopers should be required before they can be labelled 'positive'.

This is misleading, because it ignores both laboratory procedures for identifying false positives and the use of multiple tests before a final positive result is confirmed.

Anyone who works in a lab is aware of the dangers of false positives and false negatives. Everything you do to a set of samples can go wrong - and if you do it enough times it probably will. Competent scientists keep this in mind when running experiments, and include procedures that can help identify bad results. The simplest, most common, and most widely used is the use of positive and negative controls.

A positive control is a sample that you know contains whatever it is you're trying to find. A negative control is a sample that you know does not contain that substance. Each time you run tests on real samples, you run the same tests on both a positive and a negative control. If the results say that the positive control doesn't contain the substance, you know that you screwed up in a way that's likely to result in false negatives. If the negative control comes back with a positive result, you know that something went wrong that's likely to cause false positives. In either case, you discard the results from the entire run and try again.

The World Anti-Doping Agency accredits the labs that do the drug testing for elite athletes. These labs must follow a set of international standards established by WADA when they carry out tests. It should come as no surprise that these standards include a requirement that positive and negative controls be used in every test batch, or that they require that controls containing levels near the testing thresholds be used. They also require other good laboratory procedures. Chain of custody must be maintained on the samples, specific handling procedures must be used, and the labs are tested (at least 20 times a year) with samples from WADA.

What this means is that as far as the athlete is concerned, the relevant probability is not the probability that the test will return a false positive; it's the probability that the test will return a false positive that isn't caught by lab procedures. Even then, there are still additional mechanisms in place to protect the athletes from the false positive. This is good, because Berry is right - if you repeat tests often enough, you'll probably get a false positive (or negative) sooner or later.

If the initial screening test comes back positive, the sample is not immediately declared a positive and the athlete punished. Instead, a second test is run, using a different and more specific detection method if one is available. If this second test comes back negative, the presumption is that the first test result is in fact a false positive. What this means is that the probability that the "A" sample, as a whole, will come back with a false positive result is not equal to the probability that a single test will yield a false positive. It's equal to the probability that false positives will occur on two consecutive tests, which typically use two separate methods of testing. This isn't going to equal the probability of a false positive on test 1 multiplied by the probability of a false positive on test 2, because the two tests are not entirely independent. However, it's still going to be much less than the probability that a false positive will occur on a single test.

Even then, the athlete is not convicted of doping. There's yet another check for the possibility of a false positive. When the athlete is tested, the test sample is immediately split into two components, both of which are sealed at the test site while the athlete watches. If the "A" test comes back positive, the second, "B" sample is tested. B sample tests are not conducted on the same day as A sample testing, and any testing step where the container is open must be carried out by a different technician from the one who carried out the first test. This does not mean that the B sample test is a completely independent test from the A sample, but it does eliminate a number of the possible errors that might have lead to a double false positive on the A sample.

This means that the probability that an athlete will be punished based on a false positive is not going to be remotely close to the probability that a single test will yield a false positive. Instead, it will equal the probability that at least three consecutive tests of the same sample will all yield undetected false positives.

To put it another way, if the A and B samples both test positive for the same substance, there's very, very little chance that it's the result of anything other than something that is actually in the sample. At this point, the question becomes somewhat different - are the markers that the test looks for conclusive proof that a banned substance has been used? If they're not, they shouldn't be used in tests that can break someone's career.

That last is a harder question, and it's one where there really is the need for much more scientific examination of the testing procedures, as well as much more openness on the part of the testing authorities. That concern is very valid, and should be addressed. But on the whole, things are not as grim for athletes as Berry's article implies.

More like this

IIRC, the Landis case wasn't just a simple positive or negative to a specific banned compound but was rather that the ratio of two metabolites, which are expected to be present, was outside the "normal" range. I'm far from an expert here, but wouldn't such a test result be suspect if it didn't also compare the results to the normal range for the individual and/or consider environmental factors that could impact the ratio?

Welcome to the party.

It is true that the author put a worst-case spin on a very complicated set of situations- there are several major classes of doping products to which this analysis would apply. His quantitative analysis may be off (by orders of magnitude?). But the fundamental observation remains. There will be false positives. Without knowing more about the methods and validations, those of us on the outside cannot possibly be convinced that this is really scientific. From the things that leak out with respect to un-blinding of samples, arbitrary thresholds, refusal to discuss various kinds of false positive and miss rates, possible lack of GLP level quality control, the general inability for multiple labs to evaluate the same samples... well, confidence that this "science" would meet standards for a peer reviewed journal article is very low.

None of my analysis, anyway, should be taken to suggest I think Landis or any other doper is being unfairly railroaded. I simply do not have enough evidence to know. The little bit I do know about analytical chem means that I do not just take it on face value that lab values are correct.

The labs are supposed to always follow protocols exactingly, but since everybody knows that a test result is likely to never be actually verified, the monetary incentive is to keep cutting corners until you get caught repeatedly and are finally forced to return to doing things right -- at least while you're being watched. When the watchers leave, the cutting starts anew, and now it's based on experience.

We also know that the people who subject their subordinates to testing would never accept the same treatment of themselves.

If you are given a urine test today, and six weeks from now are told you failed the test and must be disciplined, how can you defend yourself against a witness you cannot confront in court?

Test results are taken at face value as a rule, because that's cheaper, it's easier, and the people in charge will never be subjected to the same crapshoot themselves.

Officials have athletes tested, faculties have students tested, management has workers tested, officers have the boots tested, but never is it the other way around. This is reckless exercise of power and it will never be fixed because the test methods will never be scientifically tested for sensitivity and specificity.

Yes, but therefore the allowed range is usually set in such a way that almost everybody falls into it. Of course some top atlethes are top atlethes because they have naturally strange levels of something. In that case it is possible to get an exception, which is of course a great way to cheat. Riccardo Riccò for example had one for a naturally high hematocrite value and 50% of the professional cyclists claims to suffer from asthma, which coveniently allows them to use certain substances on the doping list.

This is misleading, because it ignores both laboratory procedures for identifying false positives and the use of multiple tests before a final positive result is confirmed.

You're being unfair here. No matter how well the tests are done in the lab, they are still from 2 samples taken at the same time. Berry wrote

Detecting a banned foreign substance in an athlete's blood or urine would seem to be clear evidence of guilt. But as with testing for synthetic testosterone, such tests may actually be measuring metabolites of the drug that are naturally occurring at variable levels.

(emphasis mine)
We need to know the natural variability in the athelete. If we don't know that, it doesn't matter how well our test perform - we don't have an adequate baseline.

Just for clarity: even Berry didn't quite properly state the nature of the test that formed the sole basis of Landis' Adverse Analytical Finding. The test (Isotope Ratio Mass Spec.) is actually measuring the ratio of 13Carbon/12Carbon in one metabolite, verses the ratio of 13Carbon/12Carbon in another metabolite (where the later is not affected by use of exogenous testosterone).

In Landis' case, the screening test (which measures the ratio of testosterone and epitestosterone) and the confirmation of the screening test were both "thrown out" because of inadequate performance by the lab.

By swimyouidiot (not verified) on 08 Aug 2008 #permalink

I don't believe that the Landis T:E ratio (the screen) and the analysis for exogenous testosterone were 'thrown out' as evidence, as Landis's suspension was recently upheld at the CAS.

Yes, but therefore the allowed range is usually set in such a way that almost everybody falls into it.

With the keyword being "almost." Isn't that the point of the original question? How much is "almost"? For example, if the range is set to cover 99.5%, that still means that one in 200 will naturally be outside of the range. And you have to admit that 99.5% is "almost" everybody.

You can talk about how robust that range is, but given thousands of athletes that get tested (and assuming they have the same distribution as the normal population), that one who has values outside of the normal range won't take much consolation in it.

And of course, as you note, is the range selected based on what is normal for elite athletes, which does not have to be the same as a normal population?

After reading your comments, it seems that Berry's analysis is much more than misleading, it's downright wrong.

My understanding of Landis' number is: T/E ratios in normal folks are about 1:1. It takes a ratio of 4:1 to be guilty (so figure that includes 99.5% of folks). Landis was 11:1.

If 4:1 is 3 standard deviations, 11:1 is likely in the neighborhood of 12 standard deviations. To get this far away from normal without cheating would be a medical 'miracle'.

I'm not saying that it couldn't happen but you would think that something so extreme would have shown up in at least one of the numerous other samples he's given over the years.

By David C. Brayton (not verified) on 11 Aug 2008 #permalink

Landis T/E tests were thrown out in the first (AAA) hearing for unredeemable procedural problems, and not considered at all in the second (CAS) hearing. T/E's of 10:1 are not uncommon, and by themselves, rarely result in sanctions anymore as this has become understood, and there is need to do individual, longitudinal studies to make such a case. Even USADA did not attempt to argue strongly for a T/E in either the AAA hearing or the CAS appeal.

The CAS hearing concluded that all the procedural problems in the IRMS testing were not relevant to the analytic conclusions, and that the criteria under which certain measured delta-delta units was sufficient to conclude doping had occurred.

Berry was arguing, among other things, that the criteria may be selected incorrectly, and are likely to include false positives that are not accounted for by the protocol. WADA standards do not seem to pay particular attention to false positives, other than to say they are bad. But they do not require studies to determine the rate, and the studies supporting the testing done by the lab, if they can be called that, are not designed to determine false positive rates. To date, CAS has been unwilling to look at the issue of the quality of the WADA rules in that respect, and no one seems much interested as long as dirty dopers appear to be getting caught.

For the record, Landis's IRMS values are odd compared to most of the positives recorded by the lab in question because fo the large differences in the 5aA and 5bA values, which are usually much closer. This was argued in both hearings by Landis as indicative of something out of the ordinary, and perhaps reflective of a methodology error. USADA argued it was just different doping. Neither panel was interested in resolving underlying issues. Working from the "presumption of doping" present in the WADA code, both panels found doping had occurred.

Berry is saying that the current system is going to nail some innocent at some point. Maybe Landis was or wasn't that innocent victim. However, at some point it will happen, and the system is completely insensitive to that in a way that would not be considered acceptable in other contexts.

TBV

Landis had a TUE for cortisone injections into his hip. cortisone is synthetic cortisol. Both cortisol and testosterone are made in the body (endogenously) from cholesterol. Our bodies can get cholesterol from our diet (unless we are strict vegans), but also makes its own cholesterol. It does not seem to me to be impossible that, due to interacting metabolic pathways, some of the carbons from exogenous cortisone could end up in cholesterol and/or testosterone molecules made endogenously. Can anyone show me any scientific evidence that this CAN'T happen? I can find NO evidence in the scientific literature that anyone has ever done any study to determine if cortisone injected into a joint affects the IRMS or any other test for testosterone or its metabolites. This should be an easy study to do, since the required samples would be easy to come by--there are lots of arthritic guys that have intra-articular cortisone injections. One of the possible causes of false positive lab results is INTERFERENCE FROM DRUG THERAPY, which of course includes LEGAL drug therapy. WADA needs to show that cortisone treatment does not affect testosterone testing if it is to LEGITIMATELY use a positive result for testosterone metabolites to ruin the life of someone that has a TUE for cortisone use. If someone has information that rules out the possibility of a false positive test for the presence of exogenous testosterone due to the use of exogenous cortisone, please direct me to that information. Maybe WADA has determined that it can't occur, or maybe they haven't. We have no way to tell, since they hide so much information.

Aside from that, it seems to me that there were so many irregularities in the IRMS testing procedure as performed by LNND on Landis' samples that the results cannot be trusted and we have no way to judge whether Landis is guilty or innocent. The arbitration decision was made by lawyers who apparently do not have a very good understanding of good laboratory procedure, since they seem to think that poor laboratory practices can't affect results. They sure as Hell can. Even a good lab occasionally gets erroneous results. I work in a diagnostic lab. Once in a while (even when the control samples gave the expected results), we get a result on a patient that is not compatible with life. Yet the patient is alive, and maybe not even in bad shape. Since we are willing to question our results, we usually are able to find an explanation for why the result is erroneous. Often it is due to some mis-handling of the sample or the presence of some interfering substance. A simple example of the former would be improper sample collection procedures such that serum to be used for biochemical analysis is contaminated with the potassium EDTA anticoagulant used to collect blood for hematologic analysis, causing the serum calcium result to come out too low to be compatible with a patient that is not showing signs of hypocalcemia, and the serum potassium to be way higher than it takes to kill someone or at least have serious effects on heart rate. A simple and well-known example of treatment causing erroneous results would be KBr treatment for seizures causing an unbelievably high (erroneous) result for chloride. If it comes to that, we do not hesitate to see if our odd or questionable results are repeatable at another lab. That can help us find possible errors and to figure out why they occurred. We want to identify and explain odd or erroneous results, not cover them up or pretend they don't matter. It seems the anti-doping organizations would just put the patient in a body bag and be done with it, since they think (or just pretend) that erroneous or misleading results do not occur. If the labs say you're dead, well shut up, you're dead.

The primary thing that Mike Dunford gets wrong here in the original posting is that false positives are not about lab mistakes, it's about normal natural variations that cross the arbitrary threshold set by WADA.

The Landis case has been so highly focused on the issue of lab mistakes, that it's easy to get distracted by that. Mike Dunford is hardly alone in this mistake. I made the same mistake in a recent comment, even though I've known better for a long time.

The T/E test isn't that relevant here. The test at issue is the CIR test, which compares the amount of carbon-13 in testosterone metabolites, to the carbon-13 in other metabolites. The assumption is they should be similar if they're produced in the same body. But the reality is they're never exactly the same. The farther away from the same they get the less likely it is that they're natural.

Berry's main point is that WADA can not back up the number they've picked as a threshold. They have no idea if the false positives (from natural variation) at that threshold would be 1 in 50, or 1 in 10,000.

I looked into this though. The threshold for too big of a difference is 3 per mil (three parts per thousand). In one study WADA ran on dietary influences on carbon-13 ratios, they collected 125 samples from 5 non-doping subjects. Four samples contained pairs of metabolites at or above a 3 per mil difference, and two of those could have been interpreted as doping positives.

In 2005, WADA tested 12,000 samples from cyclists. So really, the question isn't if you want to catch innocent athletes, the question is how many innocents do you want to catch. And Berry's primary concern is that WADA has no idea.

tom

By Thomas A. Fine (not verified) on 13 Aug 2008 #permalink