A bunch of people I follow on social media were buzzing about this blog post yesterday, taking Jonah Lerher to task for “getting spun” in researching and writing this column in the Wall Street Journal about this paper on the “wisdom of crowds” effect. The effect in question is a staple of pop psychology these days, and claims that an aggregate of many guesses by people with little or no information will often turn out to be a very reasonable estimate of the true value. The new paper aims to show the influence of social effects, and in particular, that providing people with information about the guesses of others leads them to revise their guesses in a way that can undo the “wisdom of crowds” effect completely– you can get tight clusters of guesses around wrong values for purely social reasons.
The blog post is very long and takes quite a while to come around to the point, which is really two related claims: first, that Lehrer cherry-picked the example he used in his column, and second, that Lehrer was deceived by unjustifiable claims made about the research. The whole argument centers on this table:
The cherry-picking charge is, I think, on the mark– Lehrer used the median value for the third question as his illustration, because that particular aggregate value of the guesses falls within 1% of the true value. However, it’s the only one of the 18 aggregate values to get that close– in fact, it’s the only one to come within 10% of the true value. Using it as the only example is a little sleazy (though not really outside the bounds of normal journalistic practice, which says more about journalistic practice than anything else).
As for the other claim… Well, as I said on Facebook last night, the author of the post, Peter Freed, lost me completely when he wrote “The geometric mean (circled in blue, below) – whatever in God’s name that is.”
This puts me in a slightly odd position, I realize, as I have often expressed skepticism about statistical hair-splitting in psychology experiments. In fact, another much-derided piece by Lehrer on the “decline effect,” which he describes as an apparent loss of validity of scientific studies over time struck me as probably the result of taking some fairly arbitrary statistical measures a little too seriously.
In this case, though, I think the criticism of the research, and thus of Lehrer for taking it too seriously, is off base. And the manner in which is is expressed really rubs me the wrong way.
For one thing, the geometric mean is a fairly standard technique in statistical analysis for dealing with certain kinds of data. It’s something to be used cautiously– as Bill Phillips noted once in a group meeting, one second is the geometric mean of a nanosecond and thirty years (or whatever choice of absurdly large and small values you prefer– an attosecond and the age of the universe, or whatever)– but it’s a perfectly legitimate characterization of certain types of distributions.
If you’re going to take people to task for abusing statistics, you really ought to know what it is and when it’s appropriate. Brushing it off with a “whatever in God’s name that is” comes off as just a nicely dressed version of the “just plain folks” anti-intellectualism of groups like the Tea Party, rejecting overly eggheaded science.
It’s particularly bad because the geometric mean is defined in the paper. Twice, once in the Results section, and once in the Methods and Materials section. Both of these mention also include detailed justifications of why it’s an appropriate choice. Freed does airily dismiss the Introduction and Conclusion sections as too spin-laden to be worth reading, but evidently didn’t look all that closely at the sections he claimed to focus on, either.
Since this is the point on which the whole argument turns, Freed’s proud ignorance of the underlying statistics completely undermines everything else. His core argument is that the “wisdom of crowds” effect is bunk because the arithmetic mean of the guesses is a lousy estimate of the real value. Which is not surprising, given the nature of the distribution– that’s why the authors prefer the geometric mean. He blasts Lehrer for using a median value as his example, without noting that the median values are generally pretty close to the geometric means– all but one are within 20% of the geometric mean– making the median a not-too-bad (and much easier to explain) characterization of the distribution. He derides the median as the guess of a “single person,” which completely misrepresents the nature of that measure– the median is the central value of a group of numbers, and thus while the digits of the median value come from a single individual guess, the concept would be meaningless without all the other guesses. Median values are often the more appropriate choice for data that are unbounded on the high end, and thus tend to be skewed by outliers– as noted in the comments, the original “wisdom of crowds” paper a century ago used the median value as its aggregate guess. And, as the authors note at the end of the Methods and Materials section, the median should be equal to the geometric mean for a log-normal distribution, so it is a perfectly reasonable choice to characterize their data.
Even the claim that the data fail to show a “wisdom” effect (“Every single question, the arithmetic mean, and really even the geometric mean, was from a human standpoint wrong, wrong, wrong, wrong, wrong and wrong. The end.”) is off the mark for statistical reasons explained in the article. Though I prefer the description offered by one of Freed’s commenters, who makes an analogy to a “Fermi problem.” Given the open-ended nature of the questions, and the fact that they are by design questions that people in the sample wouldn’t have expert knowledge of, the guesses span many orders of magnitude, so you’re doing pretty well if you can get the answer to within a factor of 10. For this sort of process, the geometric mean is an appropriate choice of aggregate measure, and the fact that all the aggregated guesses come within a factor of 3 (and most within a factor of 2) is reasonably impressive.
So, essentially, Freed’s post is a long discussion of the dangers of being “spun” through not carefully considering scientific data that is based largely on not reading the underlying paper very carefully. The authors actually address most of the issues Freed raises, and have solid arguments for why they did what they did with their data. He doesn’t engage with their arguments at all, preferring to pontificate grandly about how it’s all just spin.
Moreover, this table isn’t even the point of the research– it’s just a summary of the starting conditions. The actual research looks at how the initial measures of “wisdom” change as the information available to the guessers changes, and when they look at that, they see a fairly clear narrowing of the range toward results that aren’t necessarily better than the initial guesses– in fact, the final answers are generally worse. At the same time, the self-reported confidence in the answers increases, as the guesses all converge toward the same (wrong) value. That’s the core finding, which is exactly as Lehrer describes it in the article. You can argue about how significant this effect really is– but again, if you’re going to do that, you need to engage with what they actually wrote.
So, despite all the social-media buzz about this (which I suspect has more to do with the way the conclusion plays into people’s pre-existing negative opinions of the mainstream press), I find myself deeply unimpressed. It’s an argument about the misuse of statistics by someone who proudly proclaims ignorance of some of the key statistical techniques used in the research. And it’s an argument about the dangers of not reading research carefully enough by someone who apparently hasn’t read the relevant paper very carefully.