NOTE: Orac was actually out rather late last night. It turns out that the more administrative responsibility he somehow seems to find the more he has to go out to dinner as a part of various cancer center-related functions. As a result, he is recycling a bit of recent material from elsewhere that he in his extreme arrogance considers just too good not to post up on this blog too. In any case, it’s always interesting to see how a different audience reacts to his stuff, and he did make some alterations to this post.
‘Tis the season, it would seem, for questioning science. Not that there’s necessarily anything wrong with questioning science and how it is done. Certainly, right here on this very blog I’ve not infrequently pointed out problems with how science, particularly medical science, is done
This time around, though, the challenge to science comes from an unexpected source in the form of an article in The New Yorker by Jonah Lehrer entitled The Truth Wears Off: Is There Something Wrong With the Scientific Method? Unfortunately, the full article is restricted only to subscribers. Fortunately, a reader sent me a PDF of the article; otherwise, I wouldn’t have bothered to discuss it. Also, Lehrer himself has elaborated a bit on questions asked of him since the article’s publication and published fairly sizable excerpts from his article here and here. In any case, I’ll try to quote as much of the article as I think I can get away with without violating fair use, and those of you who don’t have a subscription to The New Yorker might just have to trust my characterization of the rest. It’s not an ideal situation, but it’s what I have to work with.
The decline effect
I’m going to go about this in a slightly different manner than one might normally expect. First, I’m going to quote the a few sentences near the end of the article right now at the beginning, because you’ll rapidly see Orac might find them provocative, perhaps even a gauntlet thrown down. Before I do that, I should define the topic of the article, namely something that has been dubbed “the decline effect.” Basically, this is a term for a phenomenon in which initial results from experiments or studies of a scientific question are highly impressive, but, over time, become less so as the same investigators and other investigators try to replicate the results, usually as a means of building on them. In fact, Googling “the decline effect” brought up an entry from The Skeptic’s Dictionary, in which the decline effect is described thusly:
The decline effect is the notion that psychics lose their powers under continued investigation. This idea is based on the observation that subjects who do significantly better than chance in early trials tend to do worse in later trials.
In his article, Lehrer actually does cite paranormal research by Joseph Banks Rhine in the 1930s, whose testing of a self-proclaimed psychic demonstrated lots of “hits” early on, far more than were likely to be due to random chance. Rhine’s early results appeared to support the existence of extrasensory perception (ESP). However, as further testing progressed, the number of hits fell towards what would be expected by random chance alone, hence Banks’ coining of the term “decline effect” to describe it. Lehrer spends the bulk of his article describing examples of the decline effect, discussing potential explanations for this observation, and, the part that rated a bit of Insolence–the Respectful kind, this time!–trying to argue that the effect can be generalized to nearly all of science. Longtime readers would probably not find that all that much particularly irksome or objectionable in his article (well, for the most part, anyway); that is, until we get to the final paragraph:
Such anomalies demonstrate the slipperiness of empiricism. Although many scientific ideas generate conflicting results and suffer from falling effect sizes, they continue to get cited in the textbooks and drive standard medical practice. Why? Because these ideas seem true. Because they make sense. Because we can’t bear to let them go. And this is why the decline effect is so troubling. Not because it reveals the human fallibility of science, in which data are tweaked and beliefs shape perceptions. (Such shortcomings aren’t surprising, at least for scientists.) And not because it reveals that many of our most exciting theories are fleeting fads and will soon be rejected. (That idea has been around since Thomas Kuhn.) The decline effect is troubling because it reminds us how difficult it is to prove anything. We like to pretend that our experiments define the truth for us. But that’s often not the case. Just because an idea is true doesn’t mean it can be proved. And just because an idea can be proved doesn’t mean it’s true. When the experiments are done, we still have to choose what to believe.
As you might imagine, this passage rather irritated me with what appears on the surface to border on a postmodernist rejection of the scientific method as “just another way of knowing” in which we as scientists have to “choose what to believe.” Moreover, it certainly seems that many of the examples provided by Lehrer are seemingly curious examples of effect sizes declining in a variety of scientific areas over time as more work is done. What’s not quite as compelling is the way Lehrer, whether intentionally or inadvertently, gives the impression (to me, at least) of painting the decline effect as some sort of mysterious and unexplained phenomenon that isn’t adequately accounted for by the various explanations he describes in his article. He seems to paint the decline effect as a phenomenon that casts serious doubt on the whole enterprise of science in general and science-based medicine in particular, given that many of his examples come from medicine. In all fairness, Lehrer did later try to justify the way he concluded his article. to boil it all down, basically Lehrer equivocated by saying that all he meant by the above passage was that science is “a lot messier” than experiments, clinical trials, and peer review and that “no single test can define the truth.” Well, duh. (The snark in me might also say that science itself can’t actually define “The Truth.”) But if that’s all that Lehrer really meant, then why didn’t he just say so in the first place instead of sounding going all postmodernist-y on us, as though science can’t ever make any conclusions that are any more valid than “other ways of knowing”?
So which examples does Lehrer choose to bolster his case that the decline effect is a serious and underrecognized problem in science? He uses quite a few, several from medical sciences (in particular psychiatry), starting the article out with the example of second generation antipsychotics, such as Zyprexa, which appeared to be so much more effective than older antipsychotics in earlier studies but whose efficacy has recently been called into question, as more recent studies have showed lower levels of efficacy, levels that are no better than the older drugs. Of course Lehrer seems never to have heard of the “dilution effect,” whereby new drugs, once approved, are tried in larger and broader ranges of conditions and patients, in particular, in patients with milder cases of the diseases for which the drugs were designed. Over time, this frequently results in the appearance of declining efficacy, when in reality all that is happening is that physicians and scientists are pushing the envelope testing the drugs in patients who are less carefully selected than patients in the early trials. No real mystery here.
Another example came from evolutionary biology, specifically observations on fluctuating symmetry. This passage is taken from a blog post quoting Lehrer’s article:
In 1991, the Danish zoologist Anders Møller, at Uppsala University, in Sweden, made a remarkable discovery about sex, barn swallows, and symmetry. It had long been known that the asymmetrical appearance of a creature was directly linked to the amount of mutation in its genome, so that more mutations led to more “fluctuating asymmetry.” (An easy way to measure asymmetry in humans is to compare the length of the fingers on each hand.) What Møller discovered is that female barn swallows were far more likely to mate with male birds that had long, symmetrical feathers. This suggested that the picky females were using symmetry as a proxy for the quality of male genes. Møller’s paper, which was published in Nature, set off a frenzy of research. Here was an easily measured, widely applicable indicator of genetic quality, and females could be shown to gravitate toward it. Aesthetics was really about genetics.
In the three years following, there were ten independent tests of the role of fluctuating asymmetry in sexual selection, and nine of them found a relationship between symmetry and male reproductive success. It didn’t matter if scientists were looking at the hairs on fruit flies or replicating the swallow studies–females seemed to prefer males with mirrored halves. Before long, the theory was applied to humans. Researchers found, for instance, that women preferred the smell of symmetrical men, but only during the fertile phase of the menstrual cycle. Other studies claimed that females had more orgasms when their partners were symmetrical, while a paper by anthropologists at Rutgers analyzed forty Jamaican dance routines and dis- covered that symmetrical men were consistently rated as better dancers.
Then the theory started to fall apart. In 1994, there were fourteen published tests of symmetry and sexual selection, and only eight found a correlation. In 1995, there were eight papers on the subject, and only four got a positive result. By 1998, when there were twelve additional investigations of fluctuating asymmetry, only a third of them confirmed the theory. Worse still, even the studies that yielded some positive result showed a steadily declining effect size. Between 1992 and 1997, the average effect size shrank by eighty per cent.
And it’s not just fluctuating asymmetry. In 2001, Michael Jennions, a biologist at the Australian National University, set out to analyze “temporal trends” across a wide range of subjects in ecology and evolutionary biology. He looked at hundreds of papers and forty-four meta-analyses (that is, statistical syntheses of related studies), and discovered a consistent decline effect over time, as many of the theories seemed to fade into irrelevance. In fact, even when numerous variables were controlled for — Jennions knew, for instance, that the same author might publish several critical papers, which could distort his analysis–there was still a significant decrease in the validity of the hypothesis, often within a year of publication. Jennions admits that his findings are troubling, but expresses a reluctance to talk about them publicly. “This is a very sensitive issue for scientists,” he says. “You know, we’re supposed to be dealing with hard facts, the stuff that’s supposed to stand the test of time. But when you see these trends you become a little more skeptical of things.”
Jennions’ article was entitled Relationships fade with time: a meta-analysis of temporal trends in publication in ecology and evolution. Reading the article, I was actually struck by how small, at least compared to the impression that Lehrer gave in his article, the decline effect in evolutionary biology was estimated to be in Jennions’ study. Basically, Jennions examined 44 peer-reviewed meta-analyses and analyzed the relationship between effect size and year of publication; the relationship between effect size and sample size; and the relationship between standardized effect size and sample size. To boil it all down, Jennions et al concluded, “On average, there was a small but significant decline in effect size with year of publication. For the original empirical studies there was also a significant decrease in effect size as sample size increased. However, the effect of year of publication remained even after we controlled for sampling effort.” They concluded that publication bias was the “most parsimonious” explanation for this declining effect.
Personally, I’m not sure why Jennions was so reluctant to talk about such things publicly. You’d think from the quotes chosen by Lehrer for his article that scientists were all ready to come after him with pitchforks, hot tar, and feathers if he dared to point out that effect sizes reported by investigators in his scientific discipline exhibit apparent declines over the years due to publication bias and the bandwagon effect. Perhaps it’s because Jennions is not in medicine; after all, we’ve been speaking of such things publicly for a long time. Indeed, physicians generally expect that most initially promising results, even in randomized trials, will probably fail to ultimately pan out. In any case, those of us in medicine who might not have been willing to talk about such phenomena became more than willing after John Ioannidis published his provocatively titled article Why Most Published Research Findings Are False around the time of his study Contradicted and Initially Stronger Effects in Highly Cited Clinical Research. Physicians and scientists are generally aware of the shortcomings of the biomedical literature. Most, but sadly not all of us, know that early findings that haven’t been replicated yet should be viewed with extreme skepticism and that we can become more confident in results the more they are replicated and built upon, particularly if multiple lines of evidence (basic science, clinical trials, epidemiology) all converge on the same answer. The public, on the other hand, tends not to understand this, while cranks tend to jump all over Ioannidis’ work as though it is somehow a lethal indictment of science-based medicine.
John Ioannidis, John Lehrer, and the decline effect
The work of John Ioannidis, as discussed here numerous times before, provides an excellent framework to understand why effect sizes appear to decline over time. Although Ioannidis has been criticized for exaggerating the extent of the problem and even using circular reasoning, for the most part I find his analysis compelling. In medicine, in particular, early reports tend to be smaller trials and experiments that, because of their size, tend to be more prone to false positive results. Such false positive results (or, perhaps, exaggerated results that appear more positive than they really are) generate enthusiasm, and more investigators pile on. There’s often a tendency to want to publish confirmatory papers early on (the “bandwagon effect”), which might further skew the literature too far towards the positive. Ultimately, larger, more rigorous studies are done, and these studies result in a “regression to the mean” of sorts, in which the newer studies fail to replicate the large effects seen in earlier results. This is nothing more than what I’ve been saying time and time again, namely that the normal course of clinical research is to start out with observations from smaller studies, which are inherently less reliable because they are small and thus more prone to false positives or exaggerated effect sizes
In his article, Lehrer blames in essence three things for the decline effect: publication bias, selective reporting, and the culture of science, which contributes to the proliferation of the first two problems. Publication bias has been discussed here on multiple occasions and in various contexts. Basically, it’s the phenomenon in which there is a marked bias towards the publication of “positive” data; in other words, negative studies tend not to be reported as often or tend to end up being published in lower tier, lower “impact” journals. To Lehrer, however, publication bias is not adequate to explain the decline effect because, according to him:
While publication bias almost certainly plays a role in the decline effect, it remains an incomplete explanation. For one thing, it fails to account for the prevalence of positive results among studies that never even get submitted to journals. It also fails to explain the experience of people like Schooler, who have been unable to replicate their initial data despite their best efforts.
This is what is known about being (probably) right for the wrong reasons. I would certainly agree that publication bias is probably an incomplete explanation for the decline effect, although I would be very curious about the prevalence of positive results among studies that never get submitted to journals; it’s pretty darned rare, in my experience, for positive results not to be submitted for publication unless there are serious flaws in the studies with positive results or some other mitigating circumstance takes hold, such as the death of the principal investigator, a conflict over the results between collaborating laboratories, or a loss of funding that prevents the completion of necessary controls or additional experiments. If Lehrer has evidence that show my impression that failure to publish positive results is rare, he does not present it.
I would also argue that Lehrer is probably only partially right (and makes a huge assumption to boot) when he argues that publication bias fails to explain why individual investigators can’t replicate their own results. Such investigators, it needs to be remembered, initially published highly positive results. When they have trouble showing effect sizes as large and seemingly robust as their initial results, doubt creeps in. Were they wrong the first time? Will reviewers give them a hard time because their current results do not show the same effect sizes as their original results? They hold back. True, this is not the same thing as publication bias, but publication bias contributes to it. A journal’s peer reviewers are probably going to give an investigator a much harder time for a result showing a smaller effect size if there is published data from before that shows a much larger effect size; better journals will be less likely to publish such a result, and investigators know it. Consequently, publication bias and selective reporting (the investigator holding back the newer, less compelling results, knowing the lower likelihood of getting it published in a top tier journal). Other investigators, not invested in the original investigator’s initial highly positive results, are less likely to hold back, and, indeed, there may even be an incentive to try to disprove a rival’s results.
Lehrer makes a good point when he points out that there is such a thing as selective reporting, wherein investigators tend to be less likely to report findings that do not fit into their current world view and might even go so far as to try to shoehorn findings into the paradigm they currently favor. He even goes so far as to give a good example of cultural effects on selective reporting, specifically the well-known tendency of studies of acupuncture from China to be far more likely to report positive results than studies of acupuncture done in “Western” nations. He points out that this discrepancy “suggests that scientists find ways to confirm their preferred hypothesis, disregarding what they don’t want to see.” Or, as Simon and Garfunkel once sang in The Boxer, “a man hears what he wants to hear and disregards the rest.” It is not surprising that scientists would share this quality with their fellow human beings, but it is devilishly difficult to identify and quantify such biases. That, of course, doesn’t stop proponents of pseudoscience from crying “bias!” whenever their results are rejected by mainstream science.
There is one other potential explanation that Lehrer seems not to consider at all: Popularity. About a year and a half ago, I discussed a fascinating study that examined the effect of popularity on the reliability of the medical literature about a topic. The study, by Pfeiffer and Hoffman, was published in PLoS ONE and entitled Large-Scale Assessment of the Effect of Popularity on the Reliability of Research, and its introduction lays out the problem:
In this context, a high popularity of research topics has been argued to have a detrimental effect on the reliability of published research findings . Two distinctive mechanisms have been suggested: First, in highly competitive fields there might be stronger incentives to “manufacture” positive results by, for example, modifying data or statistical tests until formal statistical significance is obtained . This leads to inflated error rates for individual findings: actual error probabilities are larger than those given in the publications. We refer to this mechanism as “inflated error effect”. The second effect results from multiple independent testing of the same hypotheses by competing research groups. The more often a hypothesis is tested, the more likely a positive result is obtained and published even if the hypothesis is false. Multiple independent testing increases the fraction of false hypotheses among those hypotheses that are supported by at least one positive result. Thereby it distorts the overall picture of evidence. We refer to this mechanism as “multiple testing effect”. Putting it simple, this effect means that in hot research fields one can expect to find some positive finding for almost any claim, while this is not the case in research fields with little competition , .
I discussed the implications of this paper in my usual nauseating level of detail here. Suffice to say, the more scientists working on a problem there are, the more false positives there are likely to be, but, as the field matures, there is a regression to the mean. Also, don’t forget that initial exciting results are often published in the “highest” impact journals, publication in which can really make a scientist’s career take off. However, because these results are the most provocative and might even challenge the scientific consensus strongly, they also have a tendency to turn out later to be wrong. Leaving out this aspect is a major weakness in Lehrer’s analysis, particularly given that each of the examples he provided could easily have a major component of the “popularity effect” going on.
The bottom line: Is science unreliable?
As I read Lehrer’s article, I was troubled. No, I wasn’t troubled because the implications of his article were somehow shaking my view of the reliability of science. I certainly wasn’t troubled by his discussing known problems with how science is practiced by fallible human beings, how it almost always isn’t done completely according to the idealized version of the scientific method taught to us in high school. After all, I’ve discussed the problems of publication bias and deficiencies in the peer review system seemingly ad nauseam. Rather, I was troubled by the final paragraph, quoted above, in which Lehrer seems to be implying, if not outright arguing, that science is nothing more than competing narratives between which scientists must choose, each of them not particularly well supported by data. Jerry Coyne nails it when he comments:
But let’s not throw out the baby with the bathwater. In many fields, especially physics, chemistry, and molecular biology, workers regularly repeat the results of others, since progress in their own work demands it. The material basis of heredity, for example, is DNA, a double helix whose sequence of nucleotide bases codes (in a triplet code) for proteins. We’re beginning to learn the intricate ways that genes are regulated in organisms. The material basis of heredity and development is not something we “choose” to believe: it’s something that’s been forced on us by repeated findings of many scientists. This is true for physics and chemistry as well, despite Lehrer’s suggestion that “the law of gravity hasn’t always been perfect at predicting real-world phenomena.”
Lehrer, like Gould in his book The Mismeasure of Man, has done a service by pointing out that scientists are humans after all, and that their drive for reputation–and other nonscientific issues–can affect what they produce or perceive as “truth.” But it’s a mistake to imply that all scientific truth is simply a choice among explanations that aren’t very well supported. We must remember that scientific “truth” means “the best provisional explanation, but one so compelling that you’d have to be a fool not to accept it.” Truth, then, while always provisional, is not necessarily evanescent. To the degree that Lehrer implies otherwise, his article is deeply damaging to science.
Indeed. I would argue that there is really no such thing as scientific “truth.” In fact, one thing I noticed right away in Lehrer’s articles is that the examples he chose were, by and large, taken from either psychology, parapsychology, or ecology, rather than physics and chemistry. True, he did point out an anomalous experiment that was off by 2% in estimating the gravitational constant. Given how difficult it is to measure the gravitational constant and how many scientists have done it over the years, I was actually surprised that Lehrer could only find one example of an anomalous measurement. In addition, Lehrer did point out how most gene association studies with diseases thus far have not been confirmed and how different groups find different results, but finding such associations is something that is currently popular but not a mature field. According to the “popularity effect,” it is not surprising that there is currently a lot of “noise” out in the scientific and medical literature in terms of what gene expression patterns and SNPs correlate with what disease. Over the next decade, it is very likely that many of these questions and disagreements will be sorted out scientifically.
Finally, Lehrer’s view also seems not entirely consistent in some ways. I’ll show you what I mean. On his blog, as I mentioned before, Lehrer answers reader questions and expands upon his ideas a bit. A reader asks Lehrer, “Does this mean I don’t have to believe in climate change?” Lehrer’s response is, basically, that “these are theories that have been verified in thousands of different ways by thousands of different scientists working in many different fields,” which is, of course, true, but almost irrelevant given Lehrer’s previous arguments. After all, even though I accept the scientific consensus regarding anthropogenic global warming, if publication bias and selective reporting can so distort science for so long in other fields, I have to ask how would Lehrer say he accepts the science of global warming. One way is that he quite correctly points out that the “truths” of science (I really hate using that word with respect to science) depend upon the strength of the “web” supporting them, namely the number of interconnections. We say that here ourselves time and time again as arguments against pseudoscience such as, for example, homeopathy. However, if, as Lehrer seems to be arguing, scientists already put their results into the context of what is known before, isn’t he just basically arguing for doing what we are already doing, even though he has just criticized science for being biased due to selective reporting due to scientists’ existing preconceptions?
Although Lehrer makes some good points, where he stumbles, from my perspective, is when he appears to conflate “truth” with science or, more properly, accept the idea that there are scientific “truths,” even going so far as to use the word in the title of his article. That is a profound misrepresentation of the nature of science, in which all “truths” are provisional and all “truths” are subject to revision based on evidence and experimentation. The decline effect–or, as Lehrer describes it the title of his article, the “truth wearing off”–is nothing more than science doing what science does so well: Correcting itself in its usual messy and glorious way.