Machine translation taking a quantum leap forward

Steven Pinker points out in The Language Instinct that the potential ambiguities in any sentence makes programming computers to understand language quite difficult: humans can quickly determine the appropriate interpretation through context; computers are unable to understand context, and therefore they flounder, and so have difficulty translating texts. The sentence "Time flies like an arrow," for example, can be interpreted in five different ways. Here are just a couple of ways:

When timing houseflies, time them in the same manner in which you time arrows
A type of fly, a "time fly," enjoys the company of a particular arrow

While it's striking to realize the potential for ambiguity in such a simple sentence, the problem is compounded in longer sentences: in an analysis of a set of 891 sentences ranging in length from 1 to 25 words, a team led by Kathryn Baker found an average of 27 possible ways to parse each sentence. When attempting to translate between two languages, software such as the Google Language Tools faces similar difficulties in both the original and target language.

So how is the problem being addressed? Wired has an excellent article discussing one new technology:

In [previous attempts to handle the problem], called statistical-based MT, algorithms analyze large collections of previous translations, or what are technically called parallel corpora - sessions of the European Union, say, or newswire copy - to divine the statistical probabilities of words and phrases in one language ending up as particular words or phrases in another.

In the new system described in the article, instead of using parallel texts, one dictionary is used to generate all possible translations of a small chunk of the text in the target language -- say, English. Then these are compared to a 150 GB database of English phrases, identifying likely real-language equivalents.

Next, the software slides its window one word to the right, repeating the flooding process with another five- to eight-word chunk: "nuestra responsabilidad de lo que ha ocurrido en." Using what Meaningful Machines calls the decoder, it then rescores the candidate translations according to the amount of overlap between each chunk's translation options and the ones before and after it. If "We declare our responsibility for what has happened" overlaps with "declare our responsibility for what has happened in" which overlaps with "our responsibility for what has happened in Madrid," the translation is judged accurate.

The result is a system that's more accurate, and requires less data and processing time to work than previous efforts. The whole article is highly readable, and highly recommended. Mind Hacks also has a great summary of the article.

In other news:

Tags

More like this

Ambiguity is a constant problem for any embodied cognitive agent with limited resources. Decisions need to be made, and their consequences understood, despite the probabilistic veil of uncertainty enveloping everything from sensory input to action execution. Clearly, there must be mechanisms for…
This is really off-topic for GM/BM, but I just can't resist mocking the astonishing stupidity of the Conservapedia folks. I'm sure you've heard by now that Andy Schafly and his pals are working on a "new translation" of the bible. They say that they need to do this in order to remove liberal bias…
When reading the title of this post, your knowledge of the world was sufficient to let you interpret the phrase "when pigs fly," but also alerted you to the fact that it is inconsistent with much of that world knowledge: clearly, pigs don't fly. A new study by Menenti, Petersson, Scheeringa…
Last week's article on the Aymara language and metaphorical depictions of time generated a lot of discussion. I think part of the confusion there had to do less with the specific example and more with basic questions about metaphorical representations of time, so today I'm going to cover some of…

"You can't put too much water in a nuclear reactor."

By David Group (not verified) on 05 Dec 2006 #permalink

Sounds like an advance.

I use a free text to speech system called festival. While it takes some getting used to, it seems to know parts of speech, like nouns and verbs, and generally gets them right. However, it totally punted on this:

I live for live music.

which it pronounced:

I liive for liv music.

Now, it knows that 'live' can be a verb or an adjective. And, it knows that they are not pronounced the same. It just made a mistake.

So, this approach might work for text to speech too. You shouldn't have to compare to 150 GB of text. That stuff should be reduced to a word linkage databaase.

Computationally, 150 GB sounds like alot. And it is. But, this calendar year, i picked up a 160 GB hard disk for $60, new.