Enough monkeys banging on keyboards over enough time should produce, through random chance alone, sensible prose now and then.
But if the monkeys are bloggers and reporters and other people, the noise they generate would become merely pseudo-sensible because of (highly unlikely) chance events, but it should actually contain some information. With a little tweaking and a lot of filtering and analysis, it is possible to monitor the chatter for signs of emerging infectious diseases and quite possibly get on top of some of these events faster than otherwise possible.
In one of the most frequently cited examples, early indications of the severe acute respiratory syndrome (SARS) outbreak in Guangdong Province, China, came in November 2002 from a Chinese article that alluded to an unusual increase in emergency department visits with acute respiratory illness. This was followed by media reports of a respiratory disease among health care workers in February 2003, all captured by the Public Health Agency of Canada’s Global Public Health Intelligence Network (GPHIN). In parallel, online discussions on the ProMED-mail system referred to an outbreak in Guangzhou, well before official government reports were issued…
This diagram shows roughly how this might work:
Stages of HealthMap Surveillance
(1) Web-based data are acquired from a variety of Web sites every hour, 7 days a week (ranging from rumors on discussion sites to news media to validated official reports). (2) The extracted articles are then categorized by pathogen and location of the outbreak in question. (3) Articles are then analyzed for duplication and content. Duplicate articles are removed, while those that discuss new information about an ongoing situation are integrated with other related articles and added to the interactive map. (4) Once classified, articles are filtered by their relevance into five categories. Only “breaking news” articles are added as markers to the map.
According to the article in PLoS outlining this approach:
The system characterizes disease outbreak reports by means of a series of text mining algorithms. [The system works by] (a) identifying disease and location; (b) determining relevance…; and (c) grouping similar reports together while removing exact duplicates. Once the reports are automatically processed, curators correct the misclassifications of the system where necessary …
HealthMap draws from a continually expanding dictionary of pathogens (human, plant, and animal diseases) and geographic names (country, province, state, and city) to classify outbreak alert information. However, disease and place names are often ambiguous, colloquial, and subject to change, and may have multiple spellings (e.g., diarrhea, common in the US, and diarrhoea, common in the UK). Thus, the expansion and editing of the database requires extensive manual data entry.
Once location and disease have been identified, articles are automatically tagged according to their relevance. Specifically, we identify whether a given report refers to a current outbreak (“breaking news”), as opposed to reporting on other infectious disease-related news…. Finally, duplicate reports are filtered, identified, and grouped based on the similarity of the article’s headline, body text, and disease and location categories. Using a similarity score threshold, the system groups related articles into clusters that provide the collective information on a given outbreak.
Brownstein, J.S., Freifeld, C.C., Reis, B.Y., Mandl, K.D. (2008). Surveillance Sans FrontiÃ¨res: Internet-Based Emerging Infectious Disease Intelligence and the HealthMap Project. PLoS Medicine, 5(7), e151. DOI: 10.1371/journal.pmed.0050151