In case you didn’t know, reality is science fiction.
If you doubt me, read the news. Read, for example, this recent article in the New York Times about Carnegie Mellon’s “Read the Web” program, in which a computer system called NELL (Never Ending Language Learner) is systematically reading the internet and analyzing sentences for semantic categories and facts, essentially teaching itself idiomatic English as well as educating itself in human affairs. Paging Vernor Vinge, right?
NELL reads the Web 24 hours a day, seven days a week, learning language like a human would — cumulatively, over a long period of time. It parses text on the Internet for ontological categories, like “plants,” “music” and “sports teams,” then uses contextual clues to sort out what things belong in which categories, like “Nirvana is a grunge band” (see below) and “Peyton Manning plays for the Indianapolis Colts.”
In its self-taught exploration of Internet English, NELL is 87 percent correct. And the more it learns, the more accurate it will become. According to a paper called “Toward an Architecture for Never-Ending Language Learning,” NELL has two tasks: to read, and to learn from that reading — to “learn to read better each day than the day before…go[ing] back to yesterday’s text sources and extract[ing] more information more accurately.”
Like the premise of a dystopian sci-fi story, Read the Web is wonderful-terrifying. Wonderful, because we’ve designed a computer to teach itself, because it’s a case study in life-long learning, and because the results will certainly be useful. Terrifying because it’s difficult to look at a massive computer coming up accurate pronouncements like “bliss is an emotion” without feeling a shudder of horrible gravitas. That said, I am shuttering my fearmongering sci-fi mind and embracing NELL’s mission, just one in a fascinating new field of research aimed at helping computers understand human language, using the Web as a key linguistic resource. The idea of a “Semantic Web,” an Internet as comprehensible to computers as it is to humans, has been in the computer science and AI discourse for years, with good old Sir Tim Berners-Lee carrying the leading torch. In a 2001 article for Scientific American, Berners-Lee wrote that “this structure will open up the knowledge and workings of humankind to meaningful analysis by software agents, providing a new class of tools by which we can live, work and learn together.”
Upon discovering this project, I had tons of questions about NELL: could it read other languages? Who gets the data in the end? Does it have parental controls on? So I did what I always do in such cases, which is immediately write to the people in charge and flash my ScienceBlogs credentials in the hopes of gleaning some information from them. In suit, here is a brief interview with the very gracious Professor Tom Mitchell, chair of the Machine Learning Department of the School of Computer Science at Carnegie Mellon University, and Burr Settles, a Carnegie Mellon postdoctoral fellow working on the project.
UNIVERSE Q&A WITH TOM MITCHELL AND BURR SETTLES OF CARNEGIE-MELLON UNIVERSITY
Universe: At the moment, NELL is learning language and semantic categories in English, which would mean that its learning is limited to the output of the English-speaking world. Are there any plans to expand the program to different languages?
Professor Tom Mitchell: Interestingly, NELL’s learning methods can apply equally well to other western languages as they do to English (as long as the language uses the same character set as English). We started with English because, well, we speak English. And also because that is the most-used language on the web, and we wanted NELL to have access to lots of text.
Burr Settles: In principle, the technology driving NELL is language-independent, so there is reason to believe that, given a corpus of Spanish or Chinese, it could learn equally as well. In fact, I suspect there are some languages it would perform even better with; for example syntax and orthography are generally more consistent in Spanish than in English, so the Spanish NELL might learn much more quickly and accurately.
Universe: Could an advanced NELL-like computer teach itself another language?
Burr Settles: Quite possibly. For example, imagine that NELL learns a lot about The French Revolution from English-language documents, and also knows (because we say so, or maybe because it read so!) that Wikipedia pages have corresponding translations in other languages. If NELL assumes the facts available on the English- and French-language Wikipedia pages for The French Revolution are roughly equivalent, then it could use its Knowledge to start to infer patterns, rules, word morphologies, etc. in French, and then start reading other French-language documents.
This isn’t unlike the way humans can easily pick up certain words (concrete nouns, prepositions) when traveling in foreign-language countries. I know, because I just got back from two weeks in Spain, which is why I’m absent from that fabulous New York Times photo!
Universe: When will NELL stop running?
Professor Tom Mitchell: We have absolutely no intention of stopping it from running. NELL stands for “Never Ending Language Learner.” We mean it, though of course we need to make research progress if we want to give it the ability to continue learning in useful ways.
Universe: Is NELL reading the web indiscriminately, or have you set it loose on particular corners of the Internet that are more conducive to language-learning (say, Wikipedia)?
Professor Tom Mitchell: NELL primarily uses a collection of 500,000,000 web pages that represent the most broadly popular, highly referenced pages on the web. But it also uses Google’s search engine to search for additional pages when it is looking for targeted information (e.g., for pages that will teach it more about sports teams). So it’s not in some corner of the web, but all over it.
Burr Settles: Currently, NELL reads indiscriminately. Of course, it tends to learn about proteins and cell lines mostly from biomedical documents, celebrities from news sites and gossip forums, and so on. In future versions of NELL, we hope it can decide its own learning agenda, e.g., “I’ve not read much about musical acts from the 1940s… maybe I’ll focus on those kinds of documents today!” Or, alternatively, we could say we need it to focus on a particular document. Previous successes in “machine reading” research have in fact relied on a narrow scope of knowledge (e.g., only articles about sports, or terrorism, or biomedical research) in order to learn anything. The fact that NELL learns to read reasonably well across all of these domains is actually a big step forward.
It has been interesting to hear the public’s response to NELL. There are many jokes about what will happen when it comes across 4chan or LOLcats, for example. But the reality is, those texts are already available to NELL, and it is largely ignoring them because they are so ill-formed and inconsistent.
Universe: Say NELL learns the English language well enough to be a Shakespearean scholar. What happens to the data then — do Google and Yahoo and DARPA get access to it?
Professor Tom Mitchell: Yes, and so will everybody. Already we have put NELL’s growing knowledge base up on the web. You can browse it, and also download the whole thing if you like. Furthermore, I am committed to sticking to this policy of making NELL’s extracted knowledge base available for free to anybody who wants to use it for any commercial or non-commercial purpose, for the life of this research project.
Universe: Lastly, the name NELL is a joke about the Jodie Foster movie, right?
Professor Tom Mitchell: Well, no. I didn’t really know about that movie…but I just took a look at NELL’s knowledge base, and it appears to
know about it. Take a look. There, the light grey items are low confidence hypotheses that NELL is considering but not yet committing to. The dark black items are higher confidence beliefs. So it is considering that NELL might be a movie, a disease, and/or a writer, but it’s pretty confident that Jodie Foster starred in the movie…
Nell (the movie)
Book: A Semantic Web Primer, 2nd Edition (MIT Press)
Book: Programming the Semantic Web (O’Reilly)
Populating the Semantic Web by Macro-Reading Internet Text, from Proceedings of the International Semantic Web Conference (PDF)
The Semantic Web Revisited, from IEEE Intelligent Systems