Thursday, August 28, 2008

Yet again: Trados fuzzy match woes (Expanded)

Continuing from my post of August 25, some further evidence of just how badly designed the fuzzy matching algorithms are in Trados:

So, according to Trados, "INSTALLING DISPLAY" is a 67% match for "Installing Display", while "Ownership of the Services and Marks." is a 65% match for "Description of the Service and Definitions."

A smarter matching algorithm would give more importance to meaningful words ("Description", "Service", "Definitions") than to grammatical ones ("of" "the" "and"), and would treat a difference between upper and lower case as much less significant than the chance similarity of two sentence structures.

By the way if "Installing Display" is changed to "Installing the Display", it does not come up as a fuzzy match at all (unless the fuzzy threshold is set extremely low), since it becomes a mere 40% match:

The worse thing is that all these problems have been known for years, but Trados (and now SDL/Trados) programmers have done nothing to improve the situation.


  1. Hi Riccardo

    I asked the maker of Metatexis, Mr. Bruns, this question. He tells me that the reason for not implementing this kind of weighted word value in a segment is the same for all CAT tools, because not all languages work that way. The proper translation of a long word in Italian can be a very short one in another language. The "long word equals long word" formula only works if both languages are similar.
    Then again, I personally believe that not to be true. Not really knowing how Japanese works, for example, I still assume that the most common expressions are short and that a long name or compound noun would appear longer in Japanese too, but it would be interesting to hear from an expert.

  2. Hi,

    Your examples of fuzzy matches are very interesting.

    Yes, adding linguistic information to the fuzzy matching algorithm would definitely improve the results.

    My only doubt is, how much would it cost to add this linguistic information to the +100 languages supported by TRADOS.

    If the software is going to cost 200 euros more, how much time is it going to save me by giving me better fuzzy matches?

    How much more money am I going to make with the improved fuzzy matches?

    It all bolds down again to the question of how statistically relevant these bad fuzzy matches are.

    My personal feeling is that they are not so frequent but I may be wrong.

    Your example with "Installing display" is tricky because, even if the text is identical, as a translator will most likely have to change the case to lower case by re-typing it so 65% is not such a bad indication of how much time it might take to check the match and re-type it, I think.


  3. As regards the different weight to give to certain words as against others, I don't think it's a question of long word/short word: it is true, though, that it implies creating a (short) list of words for each source language. Such lists of words would then be given less weight in the calculation of the fuzzy matches. I'm not a programmer, but this is similar to what is done in concordance programs with the lists of "stop words".

    Would the program cost more because of such changes? Possibly, but I would gladly pay for a better program, whereas, at the moment, I'm seriously considering how to switch to a competitor's program because of these very frustrations.

    As regards my example with "Installing Display" and the difference between upper case and lower case: in MS Word (and Tag Editor) the work involved in changing case would be minimal - just highlight the words to change and use Shfit+F3.

  4. Hi

    Very interesting, indeed.

    My feeling is that translators should start to think about implementing more non-proprietary tools.

    If we could count on a good open-source tool that we could collaborate on, then price shouldn't matter anymore, but just how good it gets.

    I think the more effort we put in open-source, freely-available, community-supported tools, the more computer aided translation (CAT) will improve.

    One example of collaboration could be a project to provide the linguistic information Daniel is talking about -in a manner that the tool using it can improve fuzzy searches.

    These and other web-based, community-driven efforts to create and improve CAT tools are extremely necessary for being able to provide alternatives to the obsolete, proprietary, and expensive tools we are using nowadays.


  5. @ Riccardo: Okay, a stop list of words that should be weighted LESS is probably a good idea, although a combination of both principles ought to be best.

    So, ascribing a higher word value to longer ones and lesser value to short ones in the source text AND giving a generally low value to all words in the stop list (filled with the most common prepositions, articles, conjunctives, etc.) should not be that hard to implement at all.

    I can see how this is slightly outside the field of a general CAT program and more in the direction of machine translation, but who cares what the field is called that it belongs into.

    We as translators have to get away from translation tools that are designed to primarily serve the purpose of LSP agencies and their clients and find tools that serve the purpose of the translators themselves.

    In my opinion our aim with this should be to get rid of the tedious, repetitious part of translation with the help of computer programs, but to leave the creative part to us, even enhancing it with the possibilities that modern programming can offer.

    If SDL Trados and their copycats get their way, soon we will all be proofreaders of machine translations. I suggest to turn the concept around: leave me to do what I like best and am good at, which is creatively finding 'to the point', pleasant to read translations of THOUGHTS and CONCEPTS, not just segments, and then give me a machine to help me find words, spellcheck for me and keep a database of my work for future use.

    @ Hector
    You're probably right: unless someone from our own ranks creates a solution, we will always be overrun by the commercial interests of agencies and clients. I disagree that this would have to be open source, but it is a possibility. A good commercial solution can allow for our input just as well, I think.

    - Marinus Vesseur

  6. First of all, your screen shots clearly indicate "INSTALL" vs "installing", so there is some difference there beyond capitalization.

    Out of curiosity, I ran the same sentences in DéjàVuX and got no matches at all, even with the fuzzy setting at 25% (it's normally at 75%).

    Makes me realize how much of DVX's magic works through terminology replacement, not fuzzy matching, which better suits my clientle.

    I don't think I'll change tools anytime soon!

  7. Has anyone got the new Trados 2007 suite edition or an upgrade? Does it have any necessary added features?

  8. Hi all,
    Fuzzy search is much more difficult! There's no known algo can resolve the problem (see Reinke U.'s article in 1999). What are you expecting is the correspondences of MEANING between segments. That require grown up to semantic level to ambiguity, some thing like HQ MT problem !!! There is no programmer can resolve this huge problem !!!

    And, from point of view of IR (Information Research), when you insisted on precision measure (high person of match), you must reduce the RECALL measure that cover some more hidden segment below, but these segments can give you a good answer.

    And, I think in business, this difficult function sound not be implemented before all other useful function to support translators. It's clever that the TRADOS know choose the best to invest :).

    Others, the fuzzy search can work in other deeply structure in translation memory segment (like TELA of Planas E., Similis system). One experience showed that, Similis gived more 20%, Trados gived 0,5% in fuzzy search from 20Gb of EU corpus.


  9. Hello Nguyen,

    No, I'm not expecting correspondences of meaning, but I would expect more meaningful words to be given more weight. That would "just" mean an algorithm intelligent enough as to recognize that, in English, "a", "the", and so on, are less important for the matching algorithm than nouns and verbs.


Thank you for your comment!

Unfortunately, comment spam has grown to the point that all comments need to be moderated. All legitimate comments will be published as soon as possible.