Tracking flu through online search queries.

This morning, I was made aware (by my better half) of the existence of Google Flu Trends. This is a project by Google to use search terms to create a model of flu activity across the United States. Indeed, the results have been good enough that they were reported in a Letter in Nature [1] back in November 2008 (but with a correction published online 19 February 2009). From that letter:

Seasonal influenza epidemics are a major public health concern, causing tens of millions of respiratory illnesses and 250,000 to 500,000 deaths worldwide each year. In addition to seasonal influenza, a new strain of influenza virus against which no previous immunity exists and that demonstrates human-to-human transmission could result in a pandemic with millions of fatalities. Early detection of disease activity, when followed by a rapid response, can reduce the impact of both seasonal and pandemic influenza. One way to improve early detection is to monitor health-seeking behaviour in the form of queries to online search engines, which are submitted by millions of users around the world each day. Here we present a method of analysing large numbers of Google search queries to track influenza-like illness in a population. Because the relative frequency of certain queries is highly correlated with the percentage of physician visits in which a patient presents with influenza-like symptoms, we can accurately estimate the current level of weekly influenza activity in each region of the United States, with a reporting lag of about one day. This approach may make it possible to use search queries to detect influenza epidemics in areas with a large population of web search users.

Traditional surveillance systems, including those used by the US Centers for Disease Control and Prevention (CDC) and the European Influenza Surveillance Scheme (EISS), rely on both virological and clinical data, including influenza-like illness (ILI) physician visits. The CDC publishes national and regional data from these surveillance systems on a weekly basis, typically with a 1-2-week reporting lag. ...

We sought to develop a simple model that estimates the probability that a random physician visit in a particular region is related to an ILI; this is equivalent to the percentage of ILI-related physician visits. A single explanatory variable was used: the probability that a random search query submitted from the same region is ILI-related, as determined by an automated method described below. ...

We designed an automated method of selecting ILI-related search queries, requiring no previous knowledge about influenza. We measured how effectively our model would fit the CDC ILI data in each region if we used only a single query as the explanatory variable, Q(t). Each of the 50 million candidate queries in our database was separately tested in this manner, to identify the search queries which could most accurately model the CDC ILI visit percentage in each region. Our approach rewarded queries that showed regional variations similar to the regional variations in CDC ILI data: the chance that a random search query can fit the ILI percentage in all nine regions is considerably less than the chance that a random search query can fit a single location.

So, the goal was to comb through Google queries over a five year period and to find which search terms showed the same pattern of seasonal variation as influenza-like illness (ILI) as tracked during that same five year period by the CDC. No assumptions were made at the outset about what kind of queries would show these patterns, but after the data were analyzed, the search terms that displayed the best correlations with the ILI reports turned out to be focused on flu and related topics. (As described in Table 1, these queries focus on influenza symptoms and complications, cold and flu remedies, antibiotic medications, antiviral medications, and related diseases. The letter points out that there are other common queries in the U.S., like "high school basketball," whose popularity roughly coincides with the U.S. flu season but which didn't show a good fit to the CDC data.)

The researchers used the 45 best-correlated search queries to build their model, then compared the predictions of the model against data that hadn't been used in the development of the model. They found a very good fit (mean correlation of 0.97).

The Google researchers aren't arguing that their model ought to replace actual public health surveillance, although they do note that changes in flu prevalence seem to show up more quickly in Google Flu Trends than in CDC tracking (where the reporting structure seems to build in an unavoidable lag of a week or two). And, they note that it's impossible to predict how pandemic flu, or even the threat of it, might affect people's search habits.

I think this is a really interesting piece of research. And, I have some further questions about it.

What are the implications of the "digital divide" for these results? Is it the case that the same people who don't have access to the internet to make Google searches are likely to be missed by disease surveillance mechanisms? In other words, is the digital divide mirrored by the health care divide? (Possibly this has as much to do with the ability to miss work, and to get transportation to a doctor's office, medical clinic, or hospital.) Or, are public libraries and other internet access points "capturing" a significant portion of the population susceptible to influenza outbreaks?

Which particular search terms are best correlated with ILI? (Table 1 lists the search query topics, but not actual search terms, and I can't help but wonder if particular words used in those searches signal that this is really a flu-related search, rather than a search related to some other bug going around.) How do ILI-related queries compare to broader queries about symptoms that lay people typically associate (wrongly, we are told by doctors) with "the flu"? Could it be that lay people actually do a better job discerning when they have flu-like symptoms than they are given credit for? Or is it instead the case that people don't do a good job separating flu from bad-cold, but their accurate recognition of particular kinds of symptoms that guide their Google searches marks the ILI-sufferers?

To the extent that the Google Flu Trends algorithm may actually turn out to be helpful to public health officials in tracking the emergence of ILI, will health care providers continue to regard "the university of Google" in such a negative light? Will they encourage their patients to turn to the internet as a preliminary step in their engagement with their own health care (since now there's something like reason to believe that their search queries can feed into good early detection of flu epidemics)?

If I were a health care provider, I think the fact that my patients were availing themselves of information through internet searches might make me try to give them some guidance ahead of time for assessing the reliability of the sources turned up in their searches. (Being a critical consumer of information, after all, might help keep you healthier.) I wonder if such guidance would impact not only how patients regarded their search results but also how they performed those searches in the first place -- in which case, it seems there would be a possibility that a shift in search strategies within a patient population might result in queries that made less accurate predictions of the incidence of ILI (at least given the optimized algorithm Google is using now). This is to say, the algorithm might well need to be revised in response to the ways that health care providers' recognition of its usefulness leads to a change in their interactions with their patients.

One reason flu spreads is because people are out interacting with each other while they're contagious rather than home in bed getting better and minimizing transmission. I wonder how many of the ILI-related queries were made from computers in the workplace (and how many of those were performed by a person with ILI symptoms, rather than by the well family member or friend of a person with ILI symptoms). In the event that people stopped coming to work sick and started self-quarantining at the first sign of flu, what kind of effect would that have on the queries from which the Google algorithm makes ILI predictions?

I hope the Google researchers keep an eye on some of these questions as they continue to track the performance of their model.

[1] Jeremy Ginsberg, Matthew H. Mohebbi, Rajan S. Patel, Lynnette Brammer, Mark S. Smolinski, Larry Brilliant (2008). Detecting influenza epidemics using search engine query data Nature, 457 (7232), 1012-1014 DOI: 10.1038/nature07634


More like this

Flu season has started in earnest, even though it's not "officially" flu season until week 40 (first week in October this year). How do we know it's flu season if we don't test everyone and can't count flu? We use a surveillance system. The flu surveillance system has lots of moving parts and five…
The first cases of swine flu were diagnosed in the US in San Diego in mid-April. The discovery was serendipitous, the result of out-of-season US-Mexican border surveillance and use of a new diagnostic test at the Naval Health Research Center. When the new test protocol showed infection with…
A month ago, I was diagnosing several cases of influenza-like illness (ILI) every day. Then, as suddenly as it began, it disappeared---no ILI, no positive nasal swabs for influenza A (and hence H1N1/swine flu)---just the usual strep throat, poison ivy, ankle sprains, etc. So what happened? The…
Everyone knows it's flu season. We see the evidence in birds and people with H5N1. The Indian subcontinent is awash in birds with H5N1. Sometimes here we forget to remind people it is also flu season with the regular circulating subtypes, H1 and H3 and this is shaping up to be a predominantly H1…

The TED talk… gives information about more intensive mining of the internet for early detection and early response to pandemics.

There are other more data intensive uses for the internet than simply analyzing search engine results. By searching blogs, news etc. a lot of information can be identified for the benefit of from pandemic detection/prevention to analysis/thwarting of terrorism threats.

- Bill