Now on ScienceBlogs: 'My work has been plagiarized. Now what?'

Seed Media Group

Collective Imagination

Cognitive Daily

A new cognitive psychology article nearly every day

Profile

Dave and Greta Munger Cognitive Daily reports nearly every day on fascinating peer-reviewed developments in cognition from the most respected scientists in the field.

Greta Munger is Professor of Psychology at Davidson College whose works include The History of Psychology: Fundamental Questions. Dave Munger is co-founder and editor of ResearchBlogging.org and a columnist on SEEDMAGAZINE.COM. And yes, he is married to Greta.

Recent Posts

Recent Comments

Search

Categories

Archives

Blogs

Other links

Participate in research

Other Information

« Review of George Lakoff's "Whose Freedom?" | Main | Cool visual illusions (with animations!), and an effort to explain why they occur »

Machine translation taking a quantum leap forward

Category: News
Posted on: December 5, 2006 7:41 AM, by Dave Munger

Steven Pinker points out in The Language Instinct that the potential ambiguities in any sentence makes programming computers to understand language quite difficult: humans can quickly determine the appropriate interpretation through context; computers are unable to understand context, and therefore they flounder, and so have difficulty translating texts. The sentence "Time flies like an arrow," for example, can be interpreted in five different ways. Here are just a couple of ways:

When timing houseflies, time them in the same manner in which you time arrows
A type of fly, a "time fly," enjoys the company of a particular arrow

While it's striking to realize the potential for ambiguity in such a simple sentence, the problem is compounded in longer sentences: in an analysis of a set of 891 sentences ranging in length from 1 to 25 words, a team led by Kathryn Baker found an average of 27 possible ways to parse each sentence. When attempting to translate between two languages, software such as the Google Language Tools faces similar difficulties in both the original and target language.

So how is the problem being addressed? Wired has an excellent article discussing one new technology:

In [previous attempts to handle the problem], called statistical-based MT, algorithms analyze large collections of previous translations, or what are technically called parallel corpora - sessions of the European Union, say, or newswire copy - to divine the statistical probabilities of words and phrases in one language ending up as particular words or phrases in another.

In the new system described in the article, instead of using parallel texts, one dictionary is used to generate all possible translations of a small chunk of the text in the target language -- say, English. Then these are compared to a 150 GB database of English phrases, identifying likely real-language equivalents.

Next, the software slides its window one word to the right, repeating the flooding process with another five- to eight-word chunk: "nuestra responsabilidad de lo que ha ocurrido en." Using what Meaningful Machines calls the decoder, it then rescores the candidate translations according to the amount of overlap between each chunk's translation options and the ones before and after it. If "We declare our responsibility for what has happened" overlaps with "declare our responsibility for what has happened in" which overlaps with "our responsibility for what has happened in Madrid," the translation is judged accurate.

The result is a system that's more accurate, and requires less data and processing time to work than previous efforts. The whole article is highly readable, and highly recommended. Mind Hacks also has a great summary of the article.

In other news:

Share this: Stumbleupon Reddit Email + More

Comments

1

"You can't put too much water in a nuclear reactor."

Posted by: David Group | December 5, 2006 9:43 AM

2

Sounds like an advance.

I use a free text to speech system called festival. While it takes some getting used to, it seems to know parts of speech, like nouns and verbs, and generally gets them right. However, it totally punted on this:

I live for live music.

which it pronounced:

I liive for liv music.

Now, it knows that 'live' can be a verb or an adjective. And, it knows that they are not pronounced the same. It just made a mistake.

So, this approach might work for text to speech too. You shouldn't have to compare to 150 GB of text. That stuff should be reduced to a word linkage databaase.

Computationally, 150 GB sounds like alot. And it is. But, this calendar year, i picked up a 160 GB hard disk for $60, new.

Posted by: Stephen | December 7, 2006 2:05 PM

Post a Comment

(Email is required for authentication purposes only. On some blogs, comments are moderated for spam, so your comment may not appear immediately.)





ScienceBlogs

Search ScienceBlogs:

Go to:

Advertisement
Enter to win a free copy of The Monty Hall Problem
Visit the Collective Imagination blog
Advertisement
Collective Imagination

© 2006-2009 Seed Media Group LLC. ScienceBlogs is a registered trademark of Seed Media Group. All rights reserved.

Sites by Seed Media Group: Seed Media Group | ScienceBlogs | SEEDMAGAZINE.COM