Steven Pinker points out in The Language Instinct that the potential ambiguities in any sentence makes programming computers to understand language quite difficult: humans can quickly determine the appropriate interpretation through context; computers are unable to understand context, and therefore they flounder, and so have difficulty translating texts. The sentence "Time flies like an arrow," for example, can be interpreted in five different ways. Here are just a couple of ways:
When timing houseflies, time them in the same manner in which you time arrows
A type of fly, a "time fly," enjoys the company of a particular arrow
While it's striking to realize the potential for ambiguity in such a simple sentence, the problem is compounded in longer sentences: in an analysis of a set of 891 sentences ranging in length from 1 to 25 words, a team led by Kathryn Baker found an average of 27 possible ways to parse each sentence. When attempting to translate between two languages, software such as the Google Language Tools faces similar difficulties in both the original and target language.
So how is the problem being addressed? Wired has an excellent article discussing one new technology:
In [previous attempts to handle the problem], called statistical-based MT, algorithms analyze large collections of previous translations, or what are technically called parallel corpora - sessions of the European Union, say, or newswire copy - to divine the statistical probabilities of words and phrases in one language ending up as particular words or phrases in another.
In the new system described in the article, instead of using parallel texts, one dictionary is used to generate all possible translations of a small chunk of the text in the target language -- say, English. Then these are compared to a 150 GB database of English phrases, identifying likely real-language equivalents.
Next, the software slides its window one word to the right, repeating the flooding process with another five- to eight-word chunk: "nuestra responsabilidad de lo que ha ocurrido en." Using what Meaningful Machines calls the decoder, it then rescores the candidate translations according to the amount of overlap between each chunk's translation options and the ones before and after it. If "We declare our responsibility for what has happened" overlaps with "declare our responsibility for what has happened in" which overlaps with "our responsibility for what has happened in Madrid," the translation is judged accurate.
The result is a system that's more accurate, and requires less data and processing time to work than previous efforts. The whole article is highly readable, and highly recommended. Mind Hacks also has a great summary of the article.
In other news:
- Log in to post comments
"You can't put too much water in a nuclear reactor."
Sounds like an advance.
I use a free text to speech system called festival. While it takes some getting used to, it seems to know parts of speech, like nouns and verbs, and generally gets them right. However, it totally punted on this:
I live for live music.
which it pronounced:
I liive for liv music.
Now, it knows that 'live' can be a verb or an adjective. And, it knows that they are not pronounced the same. It just made a mistake.
So, this approach might work for text to speech too. You shouldn't have to compare to 150 GB of text. That stuff should be reduced to a word linkage databaase.
Computationally, 150 GB sounds like alot. And it is. But, this calendar year, i picked up a 160 GB hard disk for $60, new.