Monday, October 21, 2013

Concordance blues

CAT tools' concordance features are usually more helpful to translators than fuzzy matching, especially in the long run - I consider my translation memories as a growing treasure that I can search to see how I translated something in the past, and for that it doesn't matter if the segments are not similar enough to suggest a fuzzy match.

But sometimes the way concordance works in some CAT tools leads me to wonder whether CAT tool programmers realize that when we search for a match in our translation memories we are not looking for a completely different word.

For example, if I'm trying to find whether or not I had previously translated "rocking strip" (a piece of a thrust bearing), getting a reminder of how I previously translated "backing strip" (the strip of paper that covers an adhesive) is worse than useless:


it just wastes time, without helping me one bit.

The example above is from Studio 2011, but I've seen similar mismatches in other translation tools. (Before anybody comments: no, I've not enabled the "character-based concordance search" for this - or any other - memory.)

I'd really like to know why CAT tools' programmers think providing this kind of results could be helpful: what's the rationale behind them? Has any translator asked for this type of matching? Would it be helpful for certain languages? (If so, shouldn't it be enabled only for those languages, and not for all of them?)

13 comments:

  1. 85% fuzzy for "backing strip" versus "rocking strip"??!! That's heavy discount territory for a lot of the losers who believe that Trados leverage discounts area reasonable reflection of real effort. MemoQ 6.5 calls this 65%, which is still excessive in my opinion.

    This is just one more reason why the time is long overdue when we drop the bogus nonsense that Trados GmbH started years ago pretending that their similarity algorithms should translate into a specific schedule of discounts. The dirty little not-such-a-secret is that there is little or no consistency in the algorithms used by any tool, and rather than focus on screwing over the people expected to do a reasonable job of linguistic rendering, maybe we should look more closely at different models of match calculation and how useful these really can be for planning work. The unit cost/commodity mentality is distracting us from the more important task of understanding how we can better predict workloads. These fuzzy calculations can also provide important clues for one's work I've found. For example, the fact that the slightest change in actual text content will knock a fuzzy match down to 90% or lower in memoQ has actually helped me avoid fatal mistakes like overlooking a missing "not" in a very long sentence. I know that one can set various kinds of penalties in different tools, but wouldn't it be interesting and perhaps useful to consider "penalty profiles" for various kinds of differences which a translator might apply in appropriate situations for different types of text or tasks? Instead we focus on rather dubious voodoo translation economics.

    ReplyDelete
    Replies
    1. On the other hand, if my memory had a translation for "ROCKING STRIP" the degree of similarity given would have been a lot less: I've seen that time and again with fuzzy matches.
      Paul Filkin says they have made improvements to the fuzzy-matching algorithms in Studio 2014, but I've not been able to see any improvements yet.
      I believe linguistics-based improvements to both fuzzy matching and concordance are long overdue - and not for SDL products only.

      Delete
  2. Hi Riccardo,
    I agree, concordance on short selections (one or two words) can be very frustrating. I'd love to be able to use inverted commas to get an exact find. In fact, Boolean search operators in general would produce much more meaningful results.
    On the positive side, I find that concordance on big chunks or whole segments produces very good results, and Studio 2014's automatic concordance now speeds up this process.
    Emma

    ReplyDelete
  3. I can't reproduce this with 2011 or 2014, but if you want to share your Translation Memory and the source text (or something similar) then I'd be very happy to take a look. The tests I ran with TMs char-based or not, find rocking strip first every time as a 100% match.
    So if something else is causing this it would be good to be able to investigate it a little and then we can share the result?
    On the fuzzy value... 2 letters changing out of 12... what value would you expect for this? I think 85% is pretty fair and reflects the common distance algorithms I think.

    ReplyDelete
    Replies
    1. I was probably not clear in my original post: this is not a case of the concordance not finding something that is present in the TM: "rocking strip" was not in the memory, so what I was expected to see was the memory correctly indicating there is nothing similar to "rocking strip" rather than saying "would you rather have a serving of backing strip instead?"

      Delete
    2. Indeed... I hope my second post answered this. I think you just need to be clear about what you want from the menu. So make sure you tell the waiter nothing else will do before you start ;-)

      Delete
    3. "On the fuzzy value... 2 letters changing out of 12... what value would you expect for this? I think 85% is pretty fair and reflects the common distance algorithms I think."

      I've long felt that the algorithms used (not only by SDL) should radically be improved, making them more intelligent and incorporating in them linguistic features - and this is a good example why. You say that 85% is pretty fair because it is 2 letters out of 12. Maybe so, but also completely useless. A better algorithm would recognize this as being one word out of 2 (50% match).

      Please bear in mind something that I did not mention in my post, but that Kevin did in his comment: fuzzy matches are used by many agencies to calculate discounts from the regular rate. A supposedly "85% match" that is, in fact, no match at all, may cost translators money.

      Delete
    4. 50%... I can see where you're coming from of course. But perhaps this isn't the way to tackle it? A 50% match doesn't really reflect the difference in terms of the technical changes to the words. So if changing this requires as much effort as changing say "strip" to "material" which is your point, then surely the answer is to renegotiate the rate bands? So treat all fuzzies as a 50% for payment rather than have the tools produce a meaningless measurement? The analysis bands are deliberately flexible in this regard.

      Delete
    5. For those customers for which I do give a discount for fuzzies I already have a single band (either 85-99% or, for a few older customers 75-99%). So negotiating different bands would not be a solution.
      I'm still curious why fuzzy matches are calculated on the bases of characters, instead of words, even if one explicitly does not select "character-based concordance search".

      Delete
    6. I think you're probably confusing terms here. The search mechanism for concordance is either word based or character based. Character based is “fuzzier" than normal concordance search so it can find misspelled words and inflected forms or variants more robustly. So character based indices are created and maintained. Many users in the olden days of Trados complained that normal concordance search was not fuzzy enough... so it was introduced. But it can increase the database size considerably and on large TMs probably prohibitively so. That is why we provide the option.

      The scoring on the other hand is based on a sort of edit-distance calculation and this will be the same whether the search was character-based or not. Don't ask me what the scoring algorithm is exactly because I haven't got a clue! But I'd say we use an algorithm which is optimized to deliver appropriate scores in most situations. So it tries to avoid the too low scores and the too high scores, but still takes (the amount of) differences in punctuation, tags, and whitespace into account to produce as reasonable a value as possible and I guess we could discuss this and how different tools represent the scoring for different types of texts in different ways for a very long time.

      So the match values are just the representation of the search results compared to what you were looking for. The ability to get more or less search results is controlled by your choice of how the TM is indexed in the first place.

      Delete
  4. One more thing I forgot to mention... the fuzzy match value is entirely down to you. The default is probably 70% (I think) but if you set this to 100% then you'll never ever see backing strip for rocking strip again! So if fuzzy values give you the blues then simply don't ask for them in the first place.

    ReplyDelete
    Replies
    1. But is it possible to set different matching thresholds for fuzzy matches and for concordance searches?

      Delete
    2. Of course. When you go to the search settings the pane on the right holds the Translation settings and then under that the Concordance settings.

      Delete

Thank you for your comment!

Unfortunately, comment spam has grown to the point that all comments need to be moderated. All legitimate comments will be published as soon as possible.