We’re bombarded with numbers every day. But seeing a number and understanding it are two different things. Far too often, the true “significance” of a figure is hidden, unknown, or misjudged. I will be returning to that theme often in these blog posts in the context of water, climate change, energy, and more. In particular, there is an important distinction between accuracy and precision.
Here is one example – reported cases of cholera worldwide. Cholera is perhaps the most widespread and serious water-related disease, directly associated with the failure to provide safe drinking water and adequate sanitation. Billions of people lack this basic human right and suffer from illness as a result. Millions die unnecessary deaths.
The World Health Organization has reported that in 2011 (the last year for which comprehensive data are available) 58 countries reported 589,854 cases of cholera.
OK, I see that number, but what does it mean? Is it accurate? Is it precise?
Accuracy and precision are not the same things. In the field of science and data, “accuracy” is typically considered to be a measure of how close a number is to that quantity’s true value.
“Precision” is a term with two relevant meanings. The first describes the degree to which repeated efforts to do, or measure, something will produce the same results. The second meaning is a measure of the relative accuracy with which any given number can be represented, and is typically expressed through the use of “significant figures.”
Take, for example, the number 123. This has three significant figures. The implication is that the actual number is not 122 or 124, but 123 precisely, with a margin of error of a half of the last place (in this case 0.5). If the actual precision of measurement is not this small, then perhaps this number should be represented as 120 (with two significant figures), or even 100 (with only one significant figure).
[A minor aside: the number 100 could have 1, 2, or 3 significant figures – we don’t know unless it is stated explicitly. One way to do this is to use decimal notation. The number 100. (with the decimal point) has three significant figures, and can also be expressed as 1.00 x 102.]
Any particular data can be accurate, precise, both, or neither.
So, back to cholera. This number of cases -- 589,854 -- seems very precise. It is reported to six significant figures – a very high degree of precision.
In fact, however, this number is an example of “false precision” – it is presented in a way (with six significant figures) that implies, incorrectly, a higher degree of both precision and accuracy than reality warrants.
Why? First, it is entirely possible that this number is exactly the sum (i.e., it is precise) of the number of cases of cholera reported to WHO by the 58 reporting countries. But experts on water-related disease note the following:
- Many countries around the world do not report water-related diseases at all. As noted above, in 2011 only 58 countries reported cholera. We know cholera occurred in countries not reporting.
- Most cholera outbreaks are not detected. Thus, even countries reporting cholera underreport.
- There is no agreed-upon standard definition for determining if a case of extreme or acute watery diarrhea is “cholera” or a different illness that presents the same way.
- Health surveillance systems (i.e., medical systems for tracking, recording, and reporting disease) vary dramatically from country to country in their quality and completeness.
- Some major countries, known to have extensive and severe cholera outbreaks, typically report zero instances of cholera because they either fear the stigma associated with the failing to provide adequate water systems or they hide cholera cases by labeling them as something else (such as acute watery diarrhea).
Thus, this highly precise number is neither precise nor accurate. Indeed, it is grossly inaccurate. The WHO acknowledges this, and indeed, believes the officially reported cases could represent only a small fraction of the actual number that occurs. Taking these uncertainties into account, WHO estimates that there are as many as 10 times more cases than are actually reported. A more detailed statistical analysis recently suggested that overall there are around 2.8 million cases of cholera every year (with an uncertainty range of 1.2 to 4.3 million) and about 91,000 deaths (with an uncertainty range of 28,000 to 140,000).
So, beware misleading numbers. The officially reported estimates of cholera cases are neither precise (despite six significant figures), nor accurate.
Finally, there is another aspect to “significance.” That is the importance of the figure in some context. In this sense, the cholera numbers may be neither accurate nor precise, but they are significant. They tell the story of a horrible and unnecessary situation – a deadly, crippling, and preventable disease that is the result of our failure to provide safe water and sanitation to all the population on the planet. Cholera is completely preventable – we've effectively eliminated it in the United States and other industrialized countries by putting in place wastewater treatment and water purification systems. Let’s improve our data collection and reporting system, so we know, accurately, the extent of the problem, and then let’s move quickly to do what is necessary to reduce and eliminate cholera.
I disagree. WHO said "In 2011, a total of 58 countries from all continents reported 589 854 cases of cholera to WHO". The number of cases REPORTED is a direct count, and use of 6 significant figures is appropriate.
It's true that this number probably greatly underestimates the number of cholera cases that occurred, but WHO never claimed to be reporting that. The number they reported is both accurate and appropriately precise.
Yes, as I noted, the WHO number is certainly "precise" based only on the "reported" data, but it is neither precise nor accurate in the context of the overall cholera situation. I don't agree, however, that the number is "accurate" in any definition of the word!
Similarly, it's annoying when press converts a rounded number (pounds, kilometers) to another (dollars, miles) and states every digit of "accuracy".
Great discussion; and absolutely important both at conceptual and fatual levels.
The key message "Let’s improve our data collection and reporting system......to reduce and eliminate cholera.". There is a need to think about changing the whole game (not just the rules of the game) in developing countries about data collection and reporting systems. Rural areas of developing countries have a significant population with marginal health infrastructure. Many institutions try to play with data to show the "progress", by presenting the cases at lower sides. Other think (including educated staff in rural health centers etc) that it is just a cosmetic requirement, and has nothing to do with any improvement. I can mention many factors which I noted during field work in such areas; however, two things are important to improve data collection and reported systems (I am sharing for discussion here)
1- Awareness at all levels (from causes and identification of such diseases to reporting)
2- Involvement of nutral institution to collect such data (such as colleges, universities, and other research-based institutions with some sort of mechanism depending upon the local condition); because many health units prefer to report figures showing "better progress" for their adminstration and the government.
My Question is : What mechanism do you suggest to achieve the target of "improvement in data collection and reporting system"?, both in rural and urban areas of developing countries ?
Rosie Redfield's objection is entirely valid — as written, the claim is most likely both very accurate and very precise. This being the case, then the issue becomes whether or not the reporting of that number, or its reporting in the context of civil discourse, is presented (implicitly or explicitly) as a claim about all outbreaks of cholera or the WHO reporting of cholera. And in that context, Gleick's criticisms are relevant.
First, surely Paulos's 1988 book, "Innumeracy" is extremely relevant to this blog within the wider context of education and civil discourse.
Second, "significance" is a term of art in mathematics and science elsewhere than in "significant figures" and nearly with as much importance and more timeliness: statistical significance. Statistical significance is frequently misunderstood and misused within science itself and even more egregiously in civil discourse (science journalism, mostly). I suspect, though perhaps wrongly, that this isn't what Gelick had in mind for his blog, but it's just as vital and relevant. What statistician Andrew Gelman has written on the topic is a good start.
Great comment, thanks. And good hint about Paulos's book. I may at times talk about statistical significance -- it is certainly often misunderstood and misused. I've often thought that a required statistics course in high school would be far more useful in life than some of the standard high school math requirements...
I'm ambivalent about stats in high school. On the one hand, I'm strongly in favor of including some probability and intro statistics into the high school math curriculum for liberal arts purposes; i.e., as an essential part of a basic education and with regard to numeracy in the context of citizenship.
On the other hand, in terms of practical utility (particularly vocational), I've read quite a few complaints from working scientists about their undergraduate statistics coursework. (In that it was much less useful than it could have been, and misleading in important respects.) But the same can be said about, well, everything in the high school curriculum and, indeed, much of the undergraduate curriculum in most subjects.
At that point we're grappling with the conflict between vocational and liberal arts education at the secondary level — we've pretty much agreed in the US to strongly favor a liberal arts approach to secondary education (and as an end to itself), but particularly in the context of certain vocations (including science) there's always some discussion favoring a more practical, technical approach.
Personally, I'd favor an even stronger universal liberal arts approach at the secondary level with an additional year or even two, with a decreased liberal arts emphasis and increased technical emphasis at the undergraduate level. But I know that's never going to happen and we'll continue to muddle through with compromises that don't do any particular thing very well.
That said, my preferred approach would then include probability and stats in high school taught as part of a broad, liberal arts math education; and then much, much better (and more) stats education for all the undergraduates who will actually be doing statistics.