Read the Internet, Speak English

i-cb746eaeb70561edb713e8f981a24f3e-computermirror.jpg

In case you didn't know, reality is science fiction.

If you doubt me, read the news. Read, for example, this recent article in the New York Times about Carnegie Mellon's "Read the Web" program, in which a computer system called NELL (Never Ending Language Learner) is systematically reading the internet and analyzing sentences for semantic categories and facts, essentially teaching itself idiomatic English as well as educating itself in human affairs. Paging Vernor Vinge, right?

NELL reads the Web 24 hours a day, seven days a week, learning language like a human would -- cumulatively, over a long period of time. It parses text on the Internet for ontological categories, like "plants," "music" and "sports teams," then uses contextual clues to sort out what things belong in which categories, like "Nirvana is a grunge band" (see below) and "Peyton Manning plays for the Indianapolis Colts."

i-0073eff35a231f39534e27e52a8f19da-nirvanagrunge.jpg

Amazing.

In its self-taught exploration of Internet English, NELL is 87 percent correct. And the more it learns, the more accurate it will become. According to a paper called "Toward an Architecture for Never-Ending Language Learning," NELL has two tasks: to read, and to learn from that reading -- to "learn to read better each day than the day before...go[ing] back to yesterday's text sources and extract[ing] more information more accurately."

Like the premise of a dystopian sci-fi story, Read the Web is wonderful-terrifying. Wonderful, because we've designed a computer to teach itself, because it's a case study in life-long learning, and because the results will certainly be useful. Terrifying because it's difficult to look at a massive computer coming up accurate pronouncements like "bliss is an emotion" without feeling a shudder of horrible gravitas. That said, I am shuttering my fearmongering sci-fi mind and embracing NELL's mission, just one in a fascinating new field of research aimed at helping computers understand human language, using the Web as a key linguistic resource. The idea of a "Semantic Web," an Internet as comprehensible to computers as it is to humans, has been in the computer science and AI discourse for years, with good old Sir Tim Berners-Lee carrying the leading torch. In a 2001 article for Scientific American, Berners-Lee wrote that "this structure will open up the knowledge and workings of humankind to meaningful analysis by software agents, providing a new class of tools by which we can live, work and learn together."

Upon discovering this project, I had tons of questions about NELL: could it read other languages? Who gets the data in the end? Does it have parental controls on? So I did what I always do in such cases, which is immediately write to the people in charge and flash my ScienceBlogs credentials in the hopes of gleaning some information from them. In suit, here is a brief interview with the very gracious Professor Tom Mitchell, chair of the Machine Learning Department of the School of Computer Science at Carnegie Mellon University, and Burr Settles, a Carnegie Mellon postdoctoral fellow working on the project.

UNIVERSE Q&A WITH TOM MITCHELL AND BURR SETTLES OF CARNEGIE-MELLON UNIVERSITY

Universe: At the moment, NELL is learning language and semantic categories in English, which would mean that its learning is limited to the output of the English-speaking world. Are there any plans to expand the program to different languages?

Professor Tom Mitchell: Interestingly, NELL's learning methods can apply equally well to other western languages as they do to English (as long as the language uses the same character set as English). We started with English because, well, we speak English. And also because that is the most-used language on the web, and we wanted NELL to have access to lots of text.

Burr Settles: In principle, the technology driving NELL is language-independent, so there is reason to believe that, given a corpus of Spanish or Chinese, it could learn equally as well. In fact, I suspect there are some languages it would perform even better with; for example syntax and orthography are generally more consistent in Spanish than in English, so the Spanish NELL might learn much more quickly and accurately.

Universe: Could an advanced NELL-like computer teach itself another language?

Burr Settles: Quite possibly. For example, imagine that NELL learns a lot about The French Revolution from English-language documents, and also knows (because we say so, or maybe because it read so!) that Wikipedia pages have corresponding translations in other languages. If NELL assumes the facts available on the English- and French-language Wikipedia pages for The French Revolution are roughly equivalent, then it could use its Knowledge to start to infer patterns, rules, word morphologies, etc. in French, and then start reading other French-language documents.

This isn't unlike the way humans can easily pick up certain words (concrete nouns, prepositions) when traveling in foreign-language countries. I know, because I just got back from two weeks in Spain, which is why I'm absent from that fabulous New York Times photo!

Universe: When will NELL stop running?

Professor Tom Mitchell: We have absolutely no intention of stopping it from running. NELL stands for "Never Ending Language Learner." We mean it, though of course we need to make research progress if we want to give it the ability to continue learning in useful ways.

Universe: Is NELL reading the web indiscriminately, or have you set it loose on particular corners of the Internet that are more conducive to language-learning (say, Wikipedia)?

Professor Tom Mitchell: NELL primarily uses a collection of 500,000,000 web pages that represent the most broadly popular, highly referenced pages on the web. But it also uses Google's search engine to search for additional pages when it is looking for targeted information (e.g., for pages that will teach it more about sports teams). So it's not in some corner of the web, but all over it.

Burr Settles: Currently, NELL reads indiscriminately. Of course, it tends to learn about proteins and cell lines mostly from biomedical documents, celebrities from news sites and gossip forums, and so on. In future versions of NELL, we hope it can decide its own learning agenda, e.g., "I've not read much about musical acts from the 1940s... maybe I'll focus on those kinds of documents today!" Or, alternatively, we could say we need it to focus on a particular document. Previous successes in "machine reading" research have in fact relied on a narrow scope of knowledge (e.g., only articles about sports, or terrorism, or biomedical research) in order to learn anything. The fact that NELL learns to read reasonably well across all of these domains is actually a big step forward.

It has been interesting to hear the public's response to NELL. There are many jokes about what will happen when it comes across 4chan or LOLcats, for example. But the reality is, those texts are already available to NELL, and it is largely ignoring them because they are so ill-formed and inconsistent.

Universe: Say NELL learns the English language well enough to be a Shakespearean scholar. What happens to the data then -- do Google and Yahoo and DARPA get access to it?

Professor Tom Mitchell: Yes, and so will everybody. Already we have put NELL's growing knowledge base up on the web. You can browse it, and also download the whole thing if you like. Furthermore, I am committed to sticking to this policy of making NELL's extracted knowledge base available for free to anybody who wants to use it for any commercial or non-commercial purpose, for the life of this research project.

Universe: Lastly, the name NELL is a joke about the Jodie Foster movie, right?

Professor Tom Mitchell: Well, no. I didn't really know about that movie...but I just took a look at NELL's knowledge base, and it appears to
know about it. Take a look. There, the light grey items are low confidence hypotheses that NELL is considering but not yet committing to. The dark black items are higher confidence beliefs. So it is considering that NELL might be a movie, a disease, and/or a writer, but it's pretty confident that Jodie Foster starred in the movie...

----------------------------------------------------------------------------------------------------------

Additional Resources:

NELL's mind-blowing Twitter feed

Nell (the movie)

Book: Semantic Web For Dummies

Book: A Semantic Web Primer, 2nd Edition (MIT Press)

Book: Programming the Semantic Web (O'Reilly)

Tim Berners-Lee on The Semantic Web, from Scientific American (PDF)

Populating the Semantic Web by Macro-Reading Internet Text, from Proceedings of the International Semantic Web Conference (PDF)

Toward an Architecture for Never Ending Language Learning (PDF)

Frequently Asked Questions about the Semantic Web, from W3C

The Semantic Web Revisited, from IEEE Intelligent Systems

Categories

More like this

87% correct makes it considerably more factually accurate than most Tea Party supporters.

The really wonderfully scary part is that they are getting that level of accuracy from a first generation version. Presumably, not only will NELL continue to advance, but the software that drives it will also be constantly updated and refined as the developers continue to study the effectiveness of NELL's learning algorithms.

SkyNet is NELL 2.0!

And upon becoming self-aware, NELL muttered one simple phrase "I can haz cheezburger?". It was truly a bad day for computer science.

By Time Traveller (not verified) on 10 Oct 2010 #permalink

And one day NELL finds the page about itself, self-references, and enters a never-ending loop.

NELL has a fundamental misunderstanding in regards to "female". Many of its criteria for that term are really more appropriate for "celebrity" or "movie star", with no distinctions at all between the genders.

What we mostly need is a tool which can translate the garbage found on the Internet - and especially that which has been through Speakwrites without having been proof-read - into coherent English. Sadly, I suspect this problem is genuinely intractable.

And, given that pages devoted to conspiracy theories and penis-enlargement products vastly outnumber factual pages on the Web, what kind of world view is this project likely to end up with?

By Ian Kemmish (not verified) on 12 Oct 2010 #permalink

I'm not sure exactly what "87% accurate" means (there are many, many possible definitions of "error" in natural language technologies), but I have to point out that if that's a per-word error rate, it works out to 2-3 errors per sentence on average. I mean, it's a high number, but way way below what we're used to in our daily interactions with people.

Some of it's hits (albiet grey) are hilarious.

Look up pluto (planet) for example.

OB SF REF: " 'Rabbits,' the calculator said."

By Chris Winter (not verified) on 13 Oct 2010 #permalink

I should explain what I remember about that for younger readers. In the SF tale, the main character (I think his name is Professor Granius) has invented a colloidal calculator and hooked it up to the local news channel in order to let it build a knowledge base. He is working on it with his nephew Bruce when aliens led by the Khafis Ghan arrive in a fleet of spaceships and take over Earth.

The rest you can discover on your own.

By Chris Winter (not verified) on 13 Oct 2010 #permalink

I think, English is necessary when you are on internet. because without you can't understand what that thing is.

I think we could say we need it to focus on a particular document. Previous successes in "machine reading" research have in fact relied on a narrow scope of knowledge.

GüneyAfrika, Almanya 3 ülke birleÅerek filmi yapmıÅtır.
Filmde Erotik gerilim ve korku sahneleri yer alıyor.

I also think that this new founding is wonderful-terrifying. Wonderful â because itâs amazing to see a computerized object learns something from reading it itself and categorizing it and terrifying- because itâs crazy to think that a computer can read our language and understand it. When I read that part about how the NELL can decide the categories of things that it reads and understand the internet (English) world it reminded me of the movie âIRobotâ. It got me thinking that weâre that far from inventing robots that could do what we say or even have co-workers that may be robots. This article made me realize how far technology has come and how advanced computers will be in the future. The article also mentioned that the NELL will be able to read other languages that use similar characters as English and Iâm looking forward to seeing that NELL software be available. Next thing you know, the NELL might be teaching us something new; kind of insane when you think about it. I think weâll be seeing and hearing more about the NELL since itâs something new that has been discovered.

After I read this article I was just surprised and astonished. The computer system called NELL (Never Ending Language Learner) is just outstanding because of all the things that it is capable of doing. It is like a computer human or something of that sort. The reason I called it a computer human is because it is able to read English and understand it like us humans. It is able to read and comprehend other languages too. This particular asset of this computer software can help us interact with people who speak different languages and it will reduce tensions between countries. Also, it learns like humans by reading and understanding information from websites and it is able to categorize them. That is amazing! Technology is getting far more advanced than what I had thought it had. This computer software has great potential of doing many things. For example, it can come with with cures for diseases, think of ways to reduce the amount of carbon dioxide in the atmosphere, and many other things that can help make our lives much better and easier. This software will make a difference and this invention is just fantastic.

By Tijo Joseph (not verified) on 24 Oct 2010 #permalink

I think just one in a fascinating new field of research aimed at helping computers understand human language, using the Web as a key linguistic resource.

NELL is incredible. I canât believe a computer system can teach itself. This is completely and utterly mind boggling. It should change the definition of life. I think that despite the fact that NELL is a computer system it should be considered a life form because it can teach itself. Hypothetically speaking, we could cryogenically freeze everyone and time the freezers to let us out in 100 years. Then, we could see what NELL has come up with. A cure for cancer? No more world hunger? The possibilities are endless. Hypothetically speaking of course.