Download Counts Predict Future Impact of Scientific Papers

ResearchBlogging.orgThe gold standard for measuring the impact of a scientific paper is counting the number of other papers that cite that paper. However, due to the drawn-out nature of the scientific publication process, there is a lag of at least a year or so after a paper is published before citations to it even begin to appear in the literature, and at least a few years are generally needed to get an accurate measure of how heavily cited an article will actually be. It's reasonable to ask, then, if there exists a mechanism to judge the impact of a paper much earlier in its lifetime.

Several analyses now indicate that how frequently a paper is downloaded soon after publication predicts--to an extent--how highly it will be cited later on. The most recent analysis (Watson, 2009) compared download counts with citations for the Journal of Vision. There is a lot of extraneous, or at least uninteresting, information in this paper (i.e. that total number of citations correlate with total paper downloads over the lifetime of a paper and that both numbers increase year after year), but the key data is in Figure 6. This figure shows that the number of downloads per day over the first 1,000 days after publication correlates well with the number of citations per year five years down the line (r = 0.62).

The oldest of these sorts of analyses (Perneger, 2004) examined papers published in BMJ in the first half of 1999. The authors found a correlation of r = 0.54 between the total number of citations after five years and the number of views of the full-length HTML version of the article during the first week after publication. This study is particularly notable, given the small window during which page views were counted compared to the long window of recording citation counts.

A more recent analysis (Brody et al., 2006) was performed a few years ago on papers in arXiv, a preprint archive for papers in math, physics, and related fields. Presumably, arXiv was used due to availability of data and the much more rapid citation turnover compared to a traditional journal. Looking specifically at papers in high energy physics, the authors found a correlation of r = 0.440 between number of citations and total downloads over the two years after publication. More poignantly, though, a similar correlation (r = 0.397) could be achieved by only looking at downloads for the first six months after publication. Thus, the number of times an article is downloaded during the first six month after being published on arXiv gives an idea of how many times it will be cited over the next two years.

In 2008, Nature Neuroscience published an analysis of some of its own citation/download numbers. Specifically, they counted the total citations (through March 2008) of papers published in 2005 and compared these values to numbers of downloads of PDF and HTML versions of articles recorded over various time windows. PDF downloads were consistently a slightly better predictor of citations than HTML downloads, and the predictive power was greatest if PDF downloads were measured for six months after publication (r = 0.724). The journal's blog, Action Potential, as a much more in-depth discussion of these numbers, but this passage from the published analysis caught my eye:

These results suggest that immediate online readership, with the assumption that everyone downloading the paper is reading it, and eventual citation counts are highly correlated, casting some doubt on the potential view that manuscript citation numbers are often a product of 'reference mining' rather than a reflection of the influential science shaping an author's work. By using a relatively recent cohort of papers for gathering citation counts, and with readership in this analysis measured well before citation totals for individual articles become more influential, we have hopefully minimized the impact of citation for historical reasons, which can be more of a risk for older papers. Although citation feedback loops may still artificially raise the cited totals for particular articles, increased readership could be just as plausible an explanation.

...

Despite the danger of using citation numbers to gauge scientific influence, it is still reassuring to note that these values do seem to reflect the overall readership of and community interest in a particular manuscript. Our results also recapitulate those from earlier attempts at comparing download statistics and citation numbers, calculated for physics or mathematics preprints and for a small sector of the medical literature. Thus, this correlation may hold up across disciplines.

Interestingly, as other 'Web 2.0' technologies such as blogs and paper commenting become more entrenched in the community as a way of providing feedback on papers and tracking their popularity, following reader traffic from these additional venues may provide other variables with which to calibrate citation numbers, providing a more complete picture of manuscript influence on a particular field. Soon, citation numbers may only be a small factor within a much larger and more complete overall metric used to gauge scientific influence, impact and importance.

I thought that this was an interesting take, since the major theme through most of these analyses is trying to use download counts to predict a presumably "better" measure of impact. Nature Neuroscience turns this logic on its head, though, implying that download counts might be as good a (or even better) measure of impact, since they more directly quantitate how many people a paper actually reaches. It's an interesting idea, and certainly the suggestion that download counts should be considered side-by-side with citation counts as a complementary measure is a good one.

Still, I think it's worth considering some caveats to this whole discussion. Firstly, although all of these analyses report statistically significant correlations between early downloads and later citations, these values are not always particularly large. Depending on the particular analysis, the variance in early download numbers only explains 20-50% of the final variance in citation numbers. Thus, download counts can only be used as a predictor of citation counts with some caution.

Also, all studies to date have looked at single journals, and I imagine that the relationship between citations and downloads will vary substantially between different journals. I would guess that journals that interact more broadly with the blogosphere (such as PLoS journals) might have a higher ratio of downloads to citations (but once again, that's just a guess, and we'd need some data before we could say that's actually true). And, of course, this will depend heavily on scientific discipline, as publication and citation norms can vary quite a bit.

Finally, download counts can probably be swayed more easily than citation numbers. For example, back in 2006 I blogged about a paper in Molecular Cancer that I was an author on. This article was subsequently designated a "highly accessed" article by the journal. However, the number of citations it has garnered has not been particularly high for the journal. This is just an anecdote, of course, and I really don't know if blogging about the article made it "highly accessed", but in my mind it's likely.

But, if at the end of the day by "impact" we really mean "reach", then these caveats are really immaterial: number of downloads is clearly a better measure of how many people read a paper than the number of citations. It's exciting that we now have access to such a metric, which would have been out of reach before the scientific literature went online. However, I think that by "impact" we usually mean "influence", so citation counts remain a more logical measure of that. Still, it's exciting to see that we might be able to predict, to some degree at least, the future impact of a paper within the first few years, months, weeks, or even days of publication.


Hat tip to Mo of Neurophilosophy.


Brody, T., Harnad, S., & Carr, L. (2006). Earlier Web usage statistics as predictors of later citation impact Journal of the American Society for Information Science and Technology, 57 (8), 1060-1072 DOI: 10.1002/asi.20373

(2008). Deciphering citation statistics Nature Neuroscience, 11 (6), 619-619 DOI: 10.1038/nn0608-619

Perneger, T. (2004). Relation between online "hit counts" and subsequent citations: prospective study of research papers in the BMJ BMJ, 329 (7465), 546-547 DOI: 10.1136/bmj.329.7465.546

Watson, A. (2009). Comparing citations and downloads for individual articles Journal of Vision, 9 (4), 1-4 DOI: 10.1167/9.4.i

Categories

More like this

Perhaps "influence" could be roughly defined by dividing the number of citations of an article by number of reads (or downloads).
For example, if a paper is influential, it will cause the author to consider it in her or his work by citing it, rising the coefficient.

Conversely, if the paper is not influential, it will be read, but will not spark further discussion. This is where the coefficient would lower.

Maybe.

Well, I think that all of these metrics measure different things. Citation counts seem to be the best measure of how much a paper has actually impacted its field. Download counts, though, are a better measure of what I would call reach. Citations/download ratios, then, might be a measure of something like "influence density". However, given how highly correlated citations and downloads appear to be, and given what I know about citation practices, I'm not sure how useful such a metric would be.

The best measure of influence, though, is probably the Y-factor, which takes into account the impact of the citing papers also. However, it takes even longer to get an accurate Y-factor than a citation count.

Indeed, your post is right on point, but don't you think a distinction should be made between open access (or even just free access) and subscription journals?

although the two models are generally hard to compare, there are more downloads of open/free access articles than there are for subscription ones. (89% more downloads, claims http://www.bmj.com/cgi/content/abstract/337/jul31_1/a568, which suggests that "in the first year after the articles were published, open-access articles were downloaded more but were no more likely to be cited than subscription-based articles.")

in the case of open access, quality doesn't seem to make a difference; and I'd conclude that, all other factors being as equal as possible, the free article will always win out in number of downloads.

which, of course, is not a bad thing for a journal. more reads means more exposure, ad revenue, prestige, etc etc. but what we really need to be concerned with in scientific publishing is quality.