Wednesday, May 05, 2010

Which free machine translation works best? The results are in

Some time ago I wrote about the study that Chinese translator Ethan Shen was conducting to compare three different free MT engines (for my earlier articles about this study, see Google, Bing and Babelfish and Google, Bing and Babelfish: some preliminary results).

Ethan has now completed phase 1 of his study, and the results are both interesting and - for me, at least - unexpected. Here below you can read a short report on Ethan's study.

From Ethan’s website you can download the full report, if you prefer to have all the details.

Real World Comparison of Online Machine Translators

by Ethan Shen
Gabble On Research Project


This paper evaluates the relative quality of three popular online translation tools: Google Translate, Bing (Microsoft) Translator, and Yahoo Babelfish. The results published below are based on a 6 week survey open to the general internet population which allowed survey takers to choose any language, enter any free-form text, and vote on the best of all translation results side-by-side ( The final data reveals that while Google Translate is widely preferred when translating long passages, Microsoft Bing Translator and Yahoo Babelfish often produce better translations for phrases below 140 characters. Also, in general Babelfish performs well in East Asian Languages such as Chinese and Korean and Bing Translator performs well in Spanish, German, and Italian.


Most Preferred Engine and Margin of Preference by Language Pair and Text Length Results

The above table describes the relationship between user preferences and translated text character length for 15 single direction languages pairings. The most preferred engine is given at each intersection (Google, Babelfish, or Bing) along with the magnitude of its lead over its closest competitor in that category (colored percentage). The language pairings excluded from this table represent sets for which preferences were overwhelming (over 100%) or insufficient data was available.

From this data, the following conclusions can be drawn:

  1. For long passages of text up to 2000 characters, survey takers generally prefer Google Translate's results across the board.

    a. The extent of Google’s lead varies dramatically from language to language. In some languages such as French, the strength of Google Translate’s engine is overwhelming. However, in several others like German, Italian, and Portuguese, Google holds only a very slim lead when compared to its biggest competitors.

    b. These observations validate our Hypothesis 1 that no single engine can perform equally well across a spectrum of languages or conditions.

  2. The greatest relative strength of statistical translation focused engine (Google Translate) has not clustered around the European Union working languages as expected. German, Italian, and Portuguese, all EU working languages are the most hotly contested from a performance perspective.

    a. One possible explanation is that large additional bodies of parallel English-French text are available from the government of Canada for which are official documents are translated into both. To a lesser extent this could explain the strength of Google Translate in Spanish as many Latin American country offer English Translations of official documents.

    b. This data partially refutes Hypothesis 2.

  3. Traditional Rules Based Translation Engines (Babelfish) performed generally well in East Asian languages such and Chinese and Korean.

    a. One possible reason for this outperformance is likely that the language specific grammar and word usages rules are more effective that association based transliteration in these situations.

    b. These finding are in line with Hypothesis 3, but the size of the data set is not large enough to confirm in a statistical significant manner.

  4. Across almost every language Bing Translator and Yahoo Babelfish gain ground or surpass Google Translate as the text length gets shorter.

    a. In Chinese, the gradual erosion of Google relative performance as total text length shrinks from 2000 characters to 50 characters is stark and representative of the comparative strength Rules Based or Hybrid Translation Engines as phrases get shorter and more straight forward.

    b. It appears that at 150 characters or less, the fiercest competition between performance of different translation models become the most heated. Some similar effects were seen at 200 characters, but to a less significant extent.

    c. Though data is not shown, a similar effect is seen for passages that are only one sentence compared to passages with multiple sentences

    d. This data strongly validates Hypothesis 4.

  5. The most interesting observation is that translation quality is not a two way street. The engine that is best for translating in one direction is not necessarily the best tool to translate back the other way.

    a. The two most obvious cases of this are French and German. Though Google Translation dominates when translating both these languages to English. It faces heavy competition when translating back from English to the foreign language.

These results are taken from a longer full research write-up.
To read the hypothesis, experiment design, extended results, practical applications and references, the full report is provided here:


  1. Fascinating stuff, thanks for sharing. We will take more time to read this in detail. In the meantime, Babelfish still provides endless amusement to translate back/forth/back/forth/back if we ever need a good laugh.

  2. I would say that we never get a quality translation from machines. Not even 20%. If you translate any technical content then the total meaning will be wrong!!!!

  3. For a different perspective on this "evaluation" and rating see:

  4. At last there is need of human edition after machine translation. So instead of using auto translation I vote for human translation.

  5. Very interesting.

    As the article states, Google Translate uses statistical machine translation, which means it learns by analyzing written text to find patterns. This is why the researchers at Google let the machine analyze billions of words. As a result, the translation quality improved dramatically. But now they're at the point where if they feed Google Translate another 100 billion words, they might only see a small improvement in quality. So it can be safely stated that Google Translate won't be as good as a human for a long long time!

  6. Thanks for your research!
    I found over the years, though, Korean to English translation is far from usable in any machine translation.

  7. Nice analysis! We carried out a similar experience, choosing a different angle - check out our results here:


Thank you for your comment!

Unfortunately, comment spam has grown to the point that all comments need to be moderated. All legitimate comments will be published as soon as possible.