New system may open up the world’s roughly 7,000 spoken languages to computer-based translation.
MIT researchers have developed a novel "unsupervised" language translation model -- meaning it runs without the need for human annotations and guidance - that could lead to faster, more efficient computer-based translations of far more languages.
Translation systems from Google, Facebook, and Amazon require training models to look for patterns in millions of documents -- such as legal and political documents, or news articles -- that have been translated into various languages by humans. Given new words in one language, they can then find the matching words and phrases in the other language.
But this translational data is time consuming and difficult to gather, and simply may not exist for many of the 7,000 languages spoken worldwide. Recently, researchers have been developing "monolingual" models that make translations between texts in two languages, but without direct translational information between the two.
In a paper being presented this week at the Conference on Empirical Methods in Natural Language Processing, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) describe a model that runs faster and more efficiently than these monolingual models.
The model leverages a metric in statistics, called Gromov-Wasserstein distance, that essentially measures distances between points in one computational space and matches them to similarly distanced points in another space. They apply that technique to "word embeddings" of two languages, which are words represented as vectors - basically, arrays of numbers -- with words of similar meanings clustered closer together. In doing so, the model quickly aligns the words, or vectors, in both embeddings that are most closely correlated by relative distances, meaning they’re likely to be direct translations.
In experiments, the researchers’ model performed as accurately as state-of-the-art monolingual models -- and sometimes more accurately -- but much more quickly and using only a fraction of the computation power.
"The model sees the words in the two languages as sets of vectors, and maps [those vectors] from one set to the other by essentially preserving relationships," says the paper’s co-author Tommi Jaakkola, a CSAIL researcher and the Thomas Siebel Professor in the Department of Electrical Engineering and Computer Science and the Institute for Data, Systems, and Society. "The approach could help translate low-resource languages or dialects, so long as they come with enough monolingual content."
The model represents a step toward one of the major goals of machine translation, which is fully unsupervised word alignment, says first author David Alvarez-Melis, a CSAIL PhD student: "If you don’t have any data that matches two languages... you can map two languages and, using these distance measurements, align them."