You've all heard about how you can predict all sorts of things, from movie grosses to flu trends, using search results. I earlier blogged about the research of Yahoo's Sharad Goel, Jake Hofman, Sebastien Lahaie, David Pennock, and Duncan Watts in this area. Since then, they've written a research article.
Here's a picture:
And here's their story:
We [Goel et al.] investigate the degree to which search behavior predicts the commercial success of cultural products, namely movies, video games, and songs. In contrast with previous work that has focused on realtime reporting of current trends, we emphasize that here our objective is to predict future activity, typically days to weeks in advance. Specifically, we use query volume to forecast opening weekend box-office revenue for feature films, first month sales of video games, and the rank of songs on the Billboard Hot 100 chart. In all cases that we consider, we find that search counts are indicative of future outcomes, but when compared with baseline models trained on publicly available data, the performance boost associated with search counts is generally modest--a pattern that, as we show, also applies to previous work on tracking flu trends.
We [Goel et al.] conclude that in the absence of other data sources, or where small improvements in predictive performance are material, search queries may provide a useful guide to the near future.
I like how they put this. My first reaction upon seeing the paper (having flipped through the graphs and not read the abstract in detail) was that it was somewhat of a debunking exercise: Search volume has been hyped as the greatest thing since sliced bread, but really it's no big whoop, it adds almost no information beyond a simple forecast. But then my thought was that, no, this is a big whoop, because, in an automatic computing environment, it could be a lot easier to gather/analyze search volume than to build those baseline models.
Sharad's paper is cool. My only suggestion is that, in addition to fitting the separate models and comparing, they do the comparison on a case-by-case basis. That is, what percentage of the individual cases are predicted better by model 1, model 2, or model 3, and what is the distribution of the difference in performance. I think they're losing something by only doing the comparisons in aggregate.
It also might be good if they could set up some sort of dynamic tracker that could perform the analysis in this paper automatically, for thousands of outcomes. Then in a year or so they'd have tons and tons of data. That would take this from an interesting project to something really cool.