Enough monkeys banging on keyboards over enough time should produce, through random chance alone, sensible prose now and then.

But if the monkeys are bloggers and reporters and other people, the noise they generate would become merely pseudo-sensible because of (highly unlikely) chance events, but it should actually contain some information. With a little tweaking and a lot of filtering and analysis, it is possible to monitor the chatter for signs of emerging infectious diseases and quite possibly get on top of some of these events faster than otherwise possible.

In one of the most frequently cited examples, early indications of the severe acute respiratory syndrome (SARS) outbreak in Guangdong Province, China, came in November 2002 from a Chinese article that alluded to an unusual increase in emergency department visits with acute respiratory illness. This was followed by media reports of a respiratory disease among health care workers in February 2003, all captured by the Public Health Agency of Canada’s Global Public Health Intelligence Network (GPHIN). In parallel, online discussions on the ProMED-mail system referred to an outbreak in Guangzhou, well before official government reports were issued…

This diagram shows roughly how this might work:

Stages of HealthMap Surveillance
(1) Web-based data are acquired from a variety of Web sites every hour, 7 days a week (ranging from rumors on discussion sites to news media to validated official reports). (2) The extracted articles are then categorized by pathogen and location of the outbreak in question. (3) Articles are then analyzed for duplication and content. Duplicate articles are removed, while those that discuss new information about an ongoing situation are integrated with other related articles and added to the interactive map. (4) Once classified, articles are filtered by their relevance into five categories. Only “breaking news” articles are added as markers to the map.

According to the article in PLoS outlining this approach:

The system characterizes disease outbreak reports by means of a series of text mining algorithms. [The system works by] (a) identifying disease and location; (b) determining relevance…; and (c) grouping similar reports together while removing exact duplicates. Once the reports are automatically processed, curators correct the misclassifications of the system where necessary …

HealthMap draws from a continually expanding dictionary of pathogens (human, plant, and animal diseases) and geographic names (country, province, state, and city) to classify outbreak alert information. However, disease and place names are often ambiguous, colloquial, and subject to change, and may have multiple spellings (e.g., diarrhea, common in the US, and diarrhoea, common in the UK). Thus, the expansion and editing of the database requires extensive manual data entry.

Once location and disease have been identified, articles are automatically tagged according to their relevance. Specifically, we identify whether a given report refers to a current outbreak (“breaking news”), as opposed to reporting on other infectious disease-related news…. Finally, duplicate reports are filtered, identified, and grouped based on the similarity of the article’s headline, body text, and disease and location categories. Using a similarity score threshold, the system groups related articles into clusters that provide the collective information on a given outbreak.

Brownstein, J.S., Freifeld, C.C., Reis, B.Y., Mandl, K.D. (2008). Surveillance Sans Frontières: Internet-Based Emerging Infectious Disease Intelligence and the HealthMap Project. PLoS Medicine, 5(7), e151. DOI: 10.1371/journal.pmed.0050151


  1. #1 Stephanie Z
    July 25, 2008

    Yay! If they can use data mining like this for “intelligence” applications, it should certainly be used for this sort of work. And I’m guessing that this, with a better defined mission and scope, will produce more useful results much more cheaply. Now, as long as it doesn’t get suppressed because it tells us too much about what other agencies are looking for and how….

  2. #2 Luna_the_cat
    July 26, 2008


    The first bit of dumb luck came disguised as a public embarrassment for the European Center for Defense Against Disease. On July 23, schoolchildren in Algiers claimed that a respiratory epidemic was spreading across the Mediterranean. The claim was based on a clever analysis of antibody data from the mass-transit system fo Algiers and Naples.
    CDD had no immediate comment, but in less than three hours, public-health hobbyists reported similar results in other cities, complete with contagion maps. The epidemic was at least one week old, probably originating in Central Africa, beyond the scope of hobbyist surveillance.
    By the time the CDD got its public relations act together, the disease had been detected in India and North America. Worse yet, a journalist in Seattle had isolated and identified the infectious agent, which turned out to be a Pseudomimivirus. …

    –The beginning of Vernor Vinge’s Rainbows End.

  3. #3 Diane S.
    July 26, 2008

    Thanks for the link to that totally cool web site.