Download Counts Predict Future Impact of Scientific Papers

By scientificactivist on April 11, 2009.

The gold standard for measuring the impact of a scientific paper is counting the number of other papers that cite that paper. However, due to the drawn-out nature of the scientific publication process, there is a lag of at least a year or so after a paper is published before citations to it even begin to appear in the literature, and at least a few years are generally needed to get an accurate measure of how heavily cited an article will actually be. It's reasonable to ask, then, if there exists a mechanism to judge the impact of a paper much earlier in its lifetime.

Several analyses now indicate that how frequently a paper is downloaded soon after publication predicts--to an extent--how highly it will be cited later on. The most recent analysis (Watson, 2009) compared download counts with citations for the Journal of Vision. There is a lot of extraneous, or at least uninteresting, information in this paper (i.e. that total number of citations correlate with total paper downloads over the lifetime of a paper and that both numbers increase year after year), but the key data is in Figure 6. This figure shows that the number of downloads per day over the first 1,000 days after publication correlates well with the number of citations per year five years down the line (r = 0.62).

The oldest of these sorts of analyses (Perneger, 2004) examined papers published in BMJ in the first half of 1999. The authors found a correlation of r = 0.54 between the total number of citations after five years and the number of views of the full-length HTML version of the article during the first week after publication. This study is particularly notable, given the small window during which page views were counted compared to the long window of recording citation counts.

A more recent analysis (Brody et al., 2006) was performed a few years ago on papers in arXiv, a preprint archive for papers in math, physics, and related fields. Presumably, arXiv was used due to availability of data and the much more rapid citation turnover compared to a traditional journal. Looking specifically at papers in high energy physics, the authors found a correlation of r = 0.440 between number of citations and total downloads over the two years after publication. More poignantly, though, a similar correlation (r = 0.397) could be achieved by only looking at downloads for the first six months after publication. Thus, the number of times an article is downloaded during the first six month after being published on arXiv gives an idea of how many times it will be cited over the next two years.

In 2008, Nature Neuroscience published an analysis of some of its own citation/download numbers. Specifically, they counted the total citations (through March 2008) of papers published in 2005 and compared these values to numbers of downloads of PDF and HTML versions of articles recorded over various time windows. PDF downloads were consistently a slightly better predictor of citations than HTML downloads, and the predictive power was greatest if PDF downloads were measured for six months after publication (r = 0.724). The journal's blog, Action Potential, as a much more in-depth discussion of these numbers, but this passage from the published analysis caught my eye:

These results suggest that immediate online readership, with the assumption that everyone downloading the paper is reading it, and eventual citation counts are highly correlated, casting some doubt on the potential view that manuscript citation numbers are often a product of 'reference mining' rather than a reflection of the influential science shaping an author's work. By using a relatively recent cohort of papers for gathering citation counts, and with readership in this analysis measured well before citation totals for individual articles become more influential, we have hopefully minimized the impact of citation for historical reasons, which can be more of a risk for older papers. Although citation feedback loops may still artificially raise the cited totals for particular articles, increased readership could be just as plausible an explanation.

...

Despite the danger of using citation numbers to gauge scientific influence, it is still reassuring to note that these values do seem to reflect the overall readership of and community interest in a particular manuscript. Our results also recapitulate those from earlier attempts at comparing download statistics and citation numbers, calculated for physics or mathematics preprints and for a small sector of the medical literature. Thus, this correlation may hold up across disciplines.

Interestingly, as other 'Web 2.0' technologies such as blogs and paper commenting become more entrenched in the community as a way of providing feedback on papers and tracking their popularity, following reader traffic from these additional venues may provide other variables with which to calibrate citation numbers, providing a more complete picture of manuscript influence on a particular field. Soon, citation numbers may only be a small factor within a much larger and more complete overall metric used to gauge scientific influence, impact and importance.

I thought that this was an interesting take, since the major theme through most of these analyses is trying to use download counts to predict a presumably "better" measure of impact. Nature Neuroscience turns this logic on its head, though, implying that download counts might be as good a (or even better) measure of impact, since they more directly quantitate how many people a paper actually reaches. It's an interesting idea, and certainly the suggestion that download counts should be considered side-by-side with citation counts as a complementary measure is a good one.

Still, I think it's worth considering some caveats to this whole discussion. Firstly, although all of these analyses report statistically significant correlations between early downloads and later citations, these values are not always particularly large. Depending on the particular analysis, the variance in early download numbers only explains 20-50% of the final variance in citation numbers. Thus, download counts can only be used as a predictor of citation counts with some caution.

Also, all studies to date have looked at single journals, and I imagine that the relationship between citations and downloads will vary substantially between different journals. I would guess that journals that interact more broadly with the blogosphere (such as PLoS journals) might have a higher ratio of downloads to citations (but once again, that's just a guess, and we'd need some data before we could say that's actually true). And, of course, this will depend heavily on scientific discipline, as publication and citation norms can vary quite a bit.

Finally, download counts can probably be swayed more easily than citation numbers. For example, back in 2006 I blogged about a paper in Molecular Cancer that I was an author on. This article was subsequently designated a "highly accessed" article by the journal. However, the number of citations it has garnered has not been particularly high for the journal. This is just an anecdote, of course, and I really don't know if blogging about the article made it "highly accessed", but in my mind it's likely.

But, if at the end of the day by "impact" we really mean "reach", then these caveats are really immaterial: number of downloads is clearly a better measure of how many people read a paper than the number of citations. It's exciting that we now have access to such a metric, which would have been out of reach before the scientific literature went online. However, I think that by "impact" we usually mean "influence", so citation counts remain a more logical measure of that. Still, it's exciting to see that we might be able to predict, to some degree at least, the future impact of a paper within the first few years, months, weeks, or even days of publication.

Hat tip to Mo of Neurophilosophy.

Brody, T., Harnad, S., & Carr, L. (2006). Earlier Web usage statistics as predictors of later citation impact Journal of the American Society for Information Science and Technology, 57 (8), 1060-1072 DOI: 10.1002/asi.20373

(2008). Deciphering citation statistics Nature Neuroscience, 11 (6), 619-619 DOI: 10.1038/nn0608-619

Perneger, T. (2004). Relation between online "hit counts" and subsequent citations: prospective study of research papers in the BMJ BMJ, 329 (7465), 546-547 DOI: 10.1136/bmj.329.7465.546

Watson, A. (2009). Comparing citations and downloads for individual articles Journal of Vision, 9 (4), 1-4 DOI: 10.1167/9.4.i

More like this

Miners' Deaths and Deadbeat Operators

Mr. Robert Carey, 45, an athracite coal miner from Shamokin, Pennsylvania was killed on Monday by falling rock/coalÂ at the Harmony Mine.Â So far this year, 26 workers at U.S.

Your H-score

Interesting conversation at lunch today: topic was academic performance metrics and of course the dreaded citation index came up, with all its variants, flaws and systematics.

Inappropriate citations?

Thomson Reuters, Nobel Prize predictions and correlation vs. causation

It's time for my annual post taking issue with Thomson Reuters (TR) Nobel Prize predictions.

Perhaps "influence" could be roughly defined by dividing the number of citations of an article by number of reads (or downloads).
For example, if a paper is influential, it will cause the author to consider it in her or his work by citing it, rising the coefficient.

Conversely, if the paper is not influential, it will be read, but will not spark further discussion. This is where the coefficient would lower.

Maybe.

Is the deafening silence due to such a great or a bad idea from my part? :) What do you think, Nick?

Well, I think that all of these metrics measure different things. Citation counts seem to be the best measure of how much a paper has actually impacted its field. Download counts, though, are a better measure of what I would call reach. Citations/download ratios, then, might be a measure of something like "influence density". However, given how highly correlated citations and downloads appear to be, and given what I know about citation practices, I'm not sure how useful such a metric would be.

The best measure of influence, though, is probably the Y-factor, which takes into account the impact of the citing papers also. However, it takes even longer to get an accurate Y-factor than a citation count.

Indeed, your post is right on point, but don't you think a distinction should be made between open access (or even just free access) and subscription journals?

although the two models are generally hard to compare, there are more downloads of open/free access articles than there are for subscription ones. (89% more downloads, claims http://www.bmj.com/cgi/content/abstract/337/jul31_1/a568, which suggests that "in the first year after the articles were published, open-access articles were downloaded more but were no more likely to be cited than subscription-based articles.")

in the case of open access, quality doesn't seem to make a difference; and I'd conclude that, all other factors being as equal as possible, the free article will always win out in number of downloads.

which, of course, is not a bad thing for a journal. more reads means more exposure, ad revenue, prestige, etc etc. but what we really need to be concerned with in scientific publishing is quality.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

A More Reality-Based Poll

May 6, 2010

Remember that strikingly inept poll analysis about the Tea Party movement from The New York Times last month? Well, the new Washington Post-ABC News poll addresses the same topic, and the Post's analysis seems to actually be rooted in reality: The conservative "tea party" movement appeals almost…

This Is a Very Dumb Poll

April 14, 2010

Actually, I should say that this is a very dumb analysis of a poll. The New York Times is really promoting its new NYT/CBS poll right now; as I write this, the top headline on the Times' homepage reads "Poll Finds Tea Party Backers Wealthier and More Educated." When I first saw that headline and…

Second UCLA Pro-Test Rally Sends Strong Message in Support of Science

April 8, 2010

Today, the UCLA chapter of Pro-Test held its second rally in support of animal research. With as many as 400 or so supporters in attendance, it looks like it was another great success! Here are a couple of early reports on the event: Tom Holder of Speaking of Research: On a beautiful sunny day in…

Anaerobic Animals Discovered on Sea Floor

April 7, 2010

This is pretty neat: scientists have apparently discovered the first example of truly anaerobic animal life (i.e. an animal that can survive in the absence of oxygen). This isn't some sort of fuzzy critter, though; instead, these are tiny (less than 1 mm in length) animals that were found on the…

ScienceBlogs Traffic Is Off the Hook!

April 6, 2010

Web traffic to ScienceBlogs.com is up about 50% over last year (and has been growing at that rate since the site's inception in 2006). That's pretty impressive! Check out the stats here: SBRelease20040610.pdf (Hat tip to DrugMonkey.)