# Surprising and interesting things about diagnostic tests

The Center for Infectious Disease Research and Policy (CIDRAP) is a resource for all manner of information on infectious diseases and especially avian influenza. At their website one can find a technical overview which compiles a lot of bird flu information scattered over many sources. But it is a technical overview (although not overly specialized). Some of the entries may not be self evident even to physicians. We’ve selected one example because it interests us and we think it might interest others. It’s just a couple of sentences but will seem counterintuitive to many.

Laboratory tests do not need to be conducted on all patients with suspected influenza. Factors that influence the decision to test or not test patients with signs and symptoms of influenza include:

[snip]

Level of influenza activity in the community: The positive predictive value of influenza tests, especially rapid assays, increases with prevalence of influenza in the community; therefore, if the prevalence of influenza is low, the utility of the tests decreases. As influenza prevalence increases, the predictive value of clinical diagnosis without laboratory testing also increases and laboratory confirmation may not be necessary. (cite omitted)

Translated this says the positive predictive value (PPV) of influenza tests is better when there is more flu in the community than when there is less flu. If the disease is rare in the community (as is H5N1 in the absence of a pandemic, even in hard hit countries like Indonesia), then the the predictive value of the test is greatly decreased. In an outbreak setting, however, you probably don’t need the test. Seems like a paradox. How does this work? It is really an elementary application of Bayes’ Theorem in probability theory but it doesn’t take a advanced mathematics to understand it. It’s basically arithmetic.

Test performance has two dimensions, accuracy and reliability, common English words with technical definitions in epidemiology. Accuracy measures how well your test reflects the true value of whatever it is you are measuring. Reliability, on the other hand, is a measure of repeatability. If you perform exactly the same measurement again, will you get the same result? This is sometimes called precision.

Accuracy and reliability are different, as you can easily see. A broken meter is quite reliable because it gives the same reading each time you measure something, but it isn’t accurate, since it isn’t related to what it is measuring. It is reliably wrong. On the other hand, a meter whose readings aren’t very repeatable (its measurements bounce around a lot) may still be sufficiently accurate for some purposes if on average it gives the right answer. If you think of shooting arrows at a target, accurate shots are grouped around the bullseye, although they be scattered, while precise or reliable shots are grouped closely together (but not necessarily on the bullseye). You would like both accuracy and reliability, of course, but how much of each you need is dependent on the purpose.

Let’s make this easier and talk only about accuracy and make it easier still by considering a test that has only two readouts, one positive and one negative, like a flu test that tells us if the person has influenza or not. Accuracy now means that when a person has the flu the test correctly tells us so, and when they don’t have the flu, the test correctly tells us they don’t. While this is a simple case on the surface, you can also see there are two ways for a test to fail. It can say someone has the flu when they don’t, or say they don’t have the flu when they do. So there are two corresponding measures of accuracy, one called sensitivity, the proportion of those who have the flu the test correctly identifies, and the other called specificity, which is the proportion of those that don’t have the flu the test correctly records as not having the flu. Note that sensitivity and specificity are related to, but not identical with “false positives” and “false negatives.” A false positive is (1 – specificity) while a false negative is (1 – sensitivity) [NB: Correct versions, as per correction, bottom of post.]. We will stick with sensitivity and specificity in this analysis, although they can be directly converted to false negatives and false positives. One way to remember what “sensitivity” refers to is to think of a sensitive test as one that is sensitive at picking up a disease when it’s there.

We need one more concept and term, predictive value. There is both a positive and negative predictive value but we will only consider positive predictive value, or PPV. The PPV of a test is the proportion of people the test says has the flu that really do have the flu. You need to stop and think about this for a moment. It sounds like sensitivity, but it isn’t. PPV is the thing most people want to know. Here’s wy. Sensitivity asks the question, if you have the flu, how likely will my test be able to tell me so. PPV, on the other hand, answers this question: if I have a positive test, how likely am I to have the flu? These are drastically different questions and provide the clue to why the amount of circulating flu in the community affects the performance of a test as measured by the PPV.

Let’s recap, using a cancer screening test as an example instead of flu (it is easier to visualize). The sensitivity of the test (say some new blood test) is the probability the test will pick up a true case of cancer. Fine. Important question. But if a patient gets the test and it is positive, he or she wants to know what that means for them, i.e., does it mean they are likely to have cancer? The PPV is the probability that you actually have cancer if the test says you do.

Here’s an example. You go to the doctor and she gives you a highly sensitive and highly specific new cancer test. Let’s say it’s 99% sensitive and 99% specific. In other words, highly accurate, a lot better than most rapid flu tests. The test comes back positive. Oh, oh, you think. I have cancer. Better make my will.

Not so fast. For most cancers your chance of actually having cancer if this very accurate test says you do is usually less than 10%, usually much less. The proportion of people in the general population with any particular kind of cancer (e.g., lung cancer) is very small, typically less than one in 10,000. Let’s work this out. If you give the test to one million people, 100 of them will have lung cancer (one in ten thousand times one million people). Your test is highly sensitive so it will correctly pick out 99 out of these 100 cancers. So far, so good. Your test is also highly specific, so it will correctly identify 99% of those without cancer as free of the disease. But since most people don’t have cancer, the remaining 1% of a large population is a lot of people, i.e., it will also misidentify many as well. In this example, 999,900 people out of a million don’t have lung cancer and of these, the test correctly labels 99% of them as not having cancer, or .99 times 999,900. But 1% will be misidentified as having cancer when they don’t. That’s 9999 people the test said had cancer when they didn’t. In all, the test identified 99 + 9999 people as having cancer, of whom only 99 out of (99 + 9999) actually did, or less than 1%. What this means is that even with an extraordinarily accurate test (99% sensitive and 99% specific), if the doctor tells you you have a positive test, your risk of actually having lung cancer is still only 1%. The next step would be to run more expensive or invasive tests to confirm or disconfirm the initial screening result.

The culprit here is easy to identify. Sensitivity and specificity are features of the test, but the PPV also involves how common the condition is in the population. If 50% of the population had the condition, then things change drastically. The PPV is now 91%, not 1%. Thus the PPV is much higher (and thus more informative) with a higher proportion of the population affected then if the condition is rare in the population. This is the source of the innocent sounding statement in the CIDRAP overview:

Level of influenza activity in the community: The positive predictive value of influenza tests, especially rapid assays, increases with prevalence of influenza in the community; therefore, if the prevalence of influenza is low, the utility of the tests decreases. As influenza prevalence increases, the predictive value of clinical diagnosis without laboratory testing also increases and laboratory confirmation may not be necessary.

This is one of the main reasons confirmatory tests are needed when testing is done before an outbreak is underway, and why testing is often not done at all when there is an outbreak. While quick and cheap tests are usually the least accurate, they are useful to reduce things to a higher yield subpopulation for more expensive and time consuming tests. Taking the example above, the initial screening test reduced the original population of 1,000,000 to 10,098 with a cancer prevalence of 1% instead of .01%. Now the PPV for a 99% sensitive and specific test (now a different one) is about 50%. Notice however that unless your test is 100% sensitive you will miss some cases, i.e., you will have some false negatives.

In practice the sensitivity and specificity of a test can be adjusted up or down by changing the threshold for what is called a positive test, but when you do so you usually trade one off for the other, i.e., if you lower the threshold for what is called a positive you increase your sensitivity but you will likely decrease your specificity. How you balance the two is one of the arts of diagnosis and screening and will depend on the costs in money and public health terms of false positives versus false negatives. Remember you can always devise a test that is 100% sensitive (just say every tested subject is positive) or 100% specific (just say every subject is negative), but usually not both at the same time, although in some cases there are means for a definitive diagnosis (e.g., an autopsy). Such tests are 100% sensitive and 100% specific. For routine diagnostic testing that is unusual, however.

This is probably more than most of you wanted to know about this subject, althugh it will be just right for some of you. We only hope there are enough in the latter category to have made this long post worthwhile.

Correction: Amico, in the comments, offers two corrections, the first of which I have now made to the text so it is correct. I inadvertantly reversed the false positive and false negative expressions. The former is (1-specificity) and false negatives are (1-sensitivity).

The other correction Amico characterized as a quibble. He would prefer I say that sensitivities and specificities are probabilities instead of proportions. Technically this is correct. In practice we use the proportion as a way to estimate the underlying probability. I wrote it the way I did because it is difficult enough to keep these terms straight for most students without using the word probability and I think for these purposes the easiest way to think of it is as a proportion, although strictly speaking Amico is correct. However nothing relies on this distinction here, which is why I assume he said it was a quibble.

June 22, 2006

This is a very clear review of these concepts. Thank you, Revere, for posting it.

2. #2 Daniel Millstone
June 22, 2006

Good essay. These are tough concepts, clearly explained. I only hope I can lay my hands on it when I need it. Thanks.

June 22, 2006

Thanks you, Revere. Yes it was WELL worth it.

4. #4 Marissa
June 22, 2006

Revere, if the PPV varies so much with prevalence it doesn’t say much good about the accuracy of the H5N1 test itself. Most of the PCR tests in the other diseases I work with have PPVs around 94-96% and no variance with prevalence at all.

5. #5 revere
June 22, 2006

Marissa: Unless sensitivity and specificity are both 100% or both zero, the PPV MUST vary with prevalence. This is forced by Bayes’ Theorem. The accuracy of a test is measured by the sensitivity and specificity. If you have such high PPVs it says, at the least, that you have high prevalence in your sample universe and high specificity n your test. It has no implications for the sensitivity of your test. You might be achieving this at the expense of Negative Predictive Value (although not necessarily).

How do you know what the PPV of your tests are? What is your gold standard?

6. #6 Kevin
June 22, 2006

I would think that prevalence would need to be modified by the selection criteria applied to chose those tested.
Prevalence in general population might be low, but test may be applied to subset of population with higher prevalence. In your cancer example, suppose the test isn’t given to 1M, it’s given to 20K who have risk factors or symptoms. Prevalence of cancer in this population is much higher than 1 in 10k.
This would have implications for current H5N1 testing since selection of test subjects is limited to those with
symptoms.

7. #7 revere
June 22, 2006

Kervin: You are absolutely right. This is the reason that screening is often aimed at “high risk” populations. It isn’t only because of the cost per pickup but the PPV. Now take Indon. It has, in West Java, 186,000 pneumonias a year, roughly. Very few are H5N1. The same is true in this country for respiratory infections. In the summer time, few are influenza A but during in outbreak the prevalence is high. Hence the PPV of exactly the same test, identical accuracy, is low in the summer and high during an outbreak. In fact it is so high, it is frequently uninformative so we don’t test.

8. #8 slovenia
June 22, 2006

Brilliant and exciting post. I want to know not just the ‘what’ but also the ‘why’ and the ‘how’and you give ’em to me. I share your technical posts with my Slovenian pediatrician of a wife and she is delighted as well. Thank you very much.

9. #9 Panic
June 22, 2006

No test (in ANY field) has 100% sensitivity and specificity, so the PPV always varies with the prevalence of the phenomena you are testing for. The only question is regarding the magnitude of this variation. In general good test characteristics (high sensitivity & specificity) and high disease prevalence result in less variation in PPV with changes in prevalence. Likewise, poor test characteristics and lower prevalence typically result in more variation in PPV with prevalence changes.

10. #10 revere
June 22, 2006

Panic: If you are talking in terms of absolute certainty, you cannot be absolutely certain of anything. But some tests have effectively 100% specificity and sensitivity, for example body temperature as a test of being alive. You can concoct an example of hibernation or cryopreservation or whether being alilve and braindead is being alive, but this and many other things we do we treat as though they have 100% sensitivity and specificity with impunity for certain purposes.

The PPV varies because of variation in prevalence if the test stays the same. We could argue about this for a long time (what does it mean for a test to stay the same, for example) but I think the point is clear. Usually you cannot choose your sensitivity and specificity independently, however. When one goes up, the other (usually) goes down. You cannot make them both go up at once unless you change the test. It is also not true that as you trade one off for the other that PPV always goes in a fixed direction (i.e., that by raising specificity the PPV must also go up). It depends functionally how they trade off against each other. I have shown this mathematically elsewhere.

11. #11 Marissa
June 22, 2006

Revere, yes, I should have said little instead of no variance–I need to be more awake! On the gold standard… this extract for example on C. trachomatis, one of the bugs I work with, from Solomon et al, CLINICAL MICROBIOLOGY REVIEWS, Oct. 2004, p. 982–1011:

Although cell culture is considered the gold standard for laboratory diagnosis, it is now accepted that isolation of C. trachomatis in cell culture is less than 100% sensitive (14, 43, 186, 189, 193).

[snip]

Cell culture has long been regarded as the gold standard of Chlamydia diagnosis because its specificity is thought to be nearly perfect. Its sensitivity is known to be imperfect. True infection status has therefore been impossible to determine, and numerical values assigned to the specificity or sensitivity of diagnostic tests vary significantly from one estimate to the next.

Despite this, such values are routinely quoted. Newer tests, such as the nucleic acid amplification tests, are
believed (for biological reasons) to be more sensitive than cell culture. Investigators comparing the performance of these tests against that of culture assume that at least some of the apparent false positives (by the new test) are actually true positives that have been missed by culture. To estimate the performance characteristics of the new test, these apparent false positives are often evaluated with discrepant analysis.

In this procedure, gold-standard-negative, new-test-positive samples are further tested by one or more other appeal assays; the return of one or more positive results from these assays labels the sample a true positive. In evaluating the nucleic acid amplification tests for C. trachomatis, the appeal tests used have typically been other nucleic acid amplification tests, often with a different method of amplification or a different target nucleic acid sequence (83, 87, 144).

12. #12 tano6
June 22, 2006

IMO using tests depends on the goal you have in what you want to detect, and not on getting a high PPV.

http://www.news.gov.hk/en/category/healthandcommunity/060617/txt/060617en05003.htm

“June 21, 2006; Health
9 more pneumonia cases reported
The Hospital Authority has received 9 pneumonia cases of unidentified cause, involving 7 men and 2 women who visited Guangdong, Hunan, Hubei, Zhejiang and Fujian provinces before the onset of symptoms. The figure brings the total number of reports to 90 since an enhanced surveillance programme began june 15.
Public hospitals are providing rapid tests for these patients, and the cases have been reported to the Centre for Health Protection.
The Hospital Authority boosted its surveillance programme, following notification from the Guangdong Health Department of a suspected case of avian flu in Shenzhen.
Under the programme, public hospitals report to the authority’s electronic flu system all patients who have pneumonia of an unidentified cause and who had travelled to affected areas of countries with confirmed human cases of bird flu in the seven days before the onset of symptoms.”

Am I right when I say HongKong here has a higher yield subpopulation (pneumonias within one week after visiting certain provinces in China) where they use tests to EXCLUDE H5N1? Can it be these patients have already been tested on bacterial pneumonias, before they came in this surveillance program?

How about treating patients with antibiotics, that maybe help in the first 3 days, as a screening on having a viral versus a bacterial infection?

I suspect these ‘quick tests’ are not used in the first place to know how to treat these patients, but to explore whether an epidemic has been caused by that truck driver in Shenzhen.

I’d like to know some more about the tests used in the public hospitals.

June 22, 2006

Well done!

Epi was always interesting, biostats sometimes, put together with real life stuff and thoughtful prose – very fine!

14. #14 Joseph Brombach M.D.
June 22, 2006

Test properties are absolutely essential to understand whenever you order ANY test — not just a flu test — from a fungus test to a heart attack test.

These are principles that I spend a lot of time explaining, in very concrete terms, to my patients when they come in asking for ‘simple’ tests.

If you don’t understand this stuff, you really can’t be an active participant in any of your own medical decisions. Surprised?

15. #15 Lenn
June 23, 2006

This is an excellent post. Thank you very much!

16. #16 Panic
June 24, 2006

REVERE: My point is that EVERY test requires the user to exercise at least a modicum of Bayesian inference. Beware the investigator (or the clinician) that does not consider the results of a test in light of the pretest probability of a positive (which relies on knowledge of prevalence). The perfect tests you’re suggesting are few and far between — and generally only behave “perfectly” within a defined subset of possible cases. Its also important to be aware that test characteristics can vary with a given test from population to population. For example, cardiac stress tests which employ treadmills to test for reversible ischemia are known to have better sensitivity for disease among men than women. These effects are beyond differences in disease prevalence between the genders.

As far as having to trade sensitivity and specificty off against each other when using a given test, that’s probably best illustrated with a receiver operating characteristic curve.

17. #17 revere
June 24, 2006

Panic: We don’t disagree. Your point that the accuracy of a test may vary with what (who) you are measuring is a good one. It is also true (at least I think so) that scientists are also spontaneous Bayesians (as are most people). Most statisticians are frequentists, however.

Regarding trading sensitivity and specificity off against each other, the ROC curve is the traditional means for depicting this. What it doesn’t show, however, is that the effect of the trade-off on PPV or NPV depends on the shape of that dependency. You can try it for yourself using linear, supralinear and sublinear dependencies and you will see that the effects on PPV and NPV are qualitatively different.

18. #18 Amico
June 25, 2006

Minor quibble and a flip: To quote you,
“Note that sensitivity and specificity are related to, but not identical with “false positives” and “false negatives.” A false positive is (1 – sensitivity) while a false negative is (1 – specificity).”
First, the quibble: sensitivity and specificity are probabilities–so,it would be more proper to say, “The probability of a false positive is…etc”.
Second, you have them the wrong way around. To whit, “false positives” are those falsely identified as positives. So the probability of this is (1-specificity). And “false negatives” are those who are falsely thus identified, and so the probability of this is (1-sensitivity).

19. #19 revere
June 25, 2006

Grazie mille, Amico: You are quite correct on both scores. I’ll append a correction.

20. #20 Marissa
June 25, 2006

Note Henry’s recent posting in which he thinks a recombination event recently occurred in the father of the large Karo cluster. I also remember the problems in confirmation of H5 at Peiris'(?) HK lab from some recent Indo samples. i’m wondering if there’s another flu strain in play here somewhere.

21. #21 revere
June 25, 2006

Marissa: Henry’s source is the same as mine. He is heavily interpreting the data, which is from Peiris’s lab. Whether he will be right about it or not remains to be seen. I am taking a look at it with some bioinformatics folks. Since we don’t have the sequences, only a summary of the changes he has noted, I’m not sure what we will be able to tell.