What can search predict?

You've all heard about how you can predict all sorts of things, from movie grosses to flu trends, using search results. I earlier blogged about the research of Yahoo's Sharad Goel, Jake Hofman, Sebastien Lahaie, David Pennock, and Duncan Watts in this area. Since then, they've written a research article.

Here's a picture:

sharadsearch.png

And here's their story:

We [Goel et al.] investigate the degree to which search behavior predicts the commercial success of cultural products, namely movies, video games, and songs. In contrast with previous work that has focused on realtime reporting of current trends, we emphasize that here our objective is to predict future activity, typically days to weeks in advance. Specifically, we use query volume to forecast opening weekend box-office revenue for feature films, first month sales of video games, and the rank of songs on the Billboard Hot 100 chart. In all cases that we consider, we find that search counts are indicative of future outcomes, but when compared with baseline models trained on publicly available data, the performance boost associated with search counts is generally modest--a pattern that, as we show, also applies to previous work on tracking flu trends.

The punchline:

We [Goel et al.] conclude that in the absence of other data sources, or where small improvements in predictive performance are material, search queries may provide a useful guide to the near future.

I like how they put this. My first reaction upon seeing the paper (having flipped through the graphs and not read the abstract in detail) was that it was somewhat of a debunking exercise: Search volume has been hyped as the greatest thing since sliced bread, but really it's no big whoop, it adds almost no information beyond a simple forecast. But then my thought was that, no, this is a big whoop, because, in an automatic computing environment, it could be a lot easier to gather/analyze search volume than to build those baseline models.

Sharad's paper is cool. My only suggestion is that, in addition to fitting the separate models and comparing, they do the comparison on a case-by-case basis. That is, what percentage of the individual cases are predicted better by model 1, model 2, or model 3, and what is the distribution of the difference in performance. I think they're losing something by only doing the comparisons in aggregate.

It also might be good if they could set up some sort of dynamic tracker that could perform the analysis in this paper automatically, for thousands of outcomes. Then in a year or so they'd have tons and tons of data. That would take this from an interesting project to something really cool.

Categories

More like this

This morning, I was made aware (by my better half) of the existence of Google Flu Trends. This is a project by Google to use search terms to create a model of flu activity across the United States. Indeed, the results have been good enough that they were reported in a Letter in Nature [1] back…
Brian Fahling, an attorney for the American Family Association, has written a highly dishonest propaganda piece for Agape Press about evolution and intelligent design. I know it's hardly sound sport to fisk these things, but someone's gotta do it. Like most religious right types, he freely combines…
          Perhaps having anticipated some bleary eyes in the audience following last night’s reception cocktails, Google’s chief economist Hal Varian  starts his Falling Walls lecture with a question: what day of the week are the most Google searches for “hangover”? The answer is, unsurprisingly,…
We do not know if the airing of "13 Reasons Why" caused an increase in suicide or not, and that in and of itself is astonishing. In the world of very advanced techniques for collecting and monitoring data, and in a world that we are led to believe is on the edge of the next epidemic, you would…