Tuesday, November 18, 2014

Studio 2014 SP2: one step forward and one backward

SDL has just released Studio 2014 SP2. This upgrade no longer relies on Java, and should therefore fix all Java-related issues that have plagued the use of MultiTerm in Studio. So, thank you to SDL for finally fixing the Java problem.
If you read through the release notes of SP2, however, in addition to various improvements, there is also a major new issue:
11. Improved word count and search logic for words containing apostrophes and dashes
Studio 2014 SP2 uses an improved algorithm for processing words that contain dashes (-) or apostrophes (‘). This improvement translates into:
Lower word count. Studio no longer treats apostrophes and dashes as word separators, but as punctuation marks that link words together. This means that Studio counts elements like “it’s” or “splash-proof” as one single word.
I can see why certain translation agencies would consider this as an “improved” algorithm, and welcome such a misfeature (just another way to pay those pesky translators less). But why should translators consider this as an improvement?
I’ve run a test on a short MS Word file I created from a Wikipedia article (I have it available, if anybody wants to repeat my test):
The results are as follows:
  • Baseline: manual word count: 195 words
  • Trados 2007: 198 words (+1.5%)
  • Studio 2011: 195 words (=)
  • Studio 2014 SP1: 193 words (-1.0%)
  • memoQ 2014: 190 words. (-2.6%)
  • MS Word 2010: 190 words (-2.6%)
  • Studio 2014 SP2: 188 words (-3.6%)
As you can see, a translator who used to be paid based on a Trados 2007 word count would concede to the translation agency a 5.1% discount just by using 2014 SP2 instead.

What seems to be happening with words that may be counted differently

A subset of the file I used for the word count includes the following:
It’s
mid-16th century
Prince-electors
The others who were left in the keep—men, women and children—were killed.
According to my manual word count these are 21 words (I count two words each for “it’s”, “mid-16th”, “Prince-electors”, and of course I count as separate words “keep”, “men”, “children”, and “were”.)
According to MS Word, these are 18 words: it counts as single words “it’s” and the two hyphenated terms “mid-16th” and “Prince-electors”; however, it correctly counts as separate words “keep” and “men”, “children” and “were”.
According to Studio 2014 SP2, however, these are 16 words: Studio 2014 SP2 is not only counting as single words “It’s”, and the two hyphenated terms, but it also counts as single words those that are separated by an m-dash.
So either SDL’s programmers don’t know the difference between an hyphen and a dash and how they are used, or the way they have implemented the change contains a bug. The former option is suggested by SDL's own release notes, which do say
Studio 2014 SP2 uses an improved algorithm for processing words that contain dashes (-) [...] This means that Studio counts [...] “splash-proof” as a single word.
“Splash-proof”, of course, does not contain a dash: it contains an hyphen, and the distinction is important, especially when not knowing the difference between a dash and an hyphen results in a lowered word count.

UPDATE

According to SDL's release notes, dashes should actually be counted correctly:
Dashes that do not follow the new logic:
  • Figure dash (‒) 
  • En dash (–) 
  • Em dash (—) 
  • Horizontal bar (―) 
  • Small Em dash (﹘)
However, my test confirms that this is not the case: try copying "The others who were left in the keep—men, women and children—were killed" into a word file, and run an analysis in Studio 2014 SP2: you'll see that the two dashes are counted as hyphens, and that the word count for the sentence (which contains 14 words), indicates 12 words.


5 comments:

  1. Hi, Riccardo :)
    Could you show me the link to the wikipedia article to do a test for my own? In my company, I've created a wordcount (http://www.kennistranslations.com/UK/KennisCounter.aspx) and I want to run a test and upgrade the counter with your annotations about apostrophes and dashes
    Thank you so much.

    ReplyDelete
  2. Hi José,

    The file is slightly modified from the Wikipedia article (I added a sentence to include the words "It's", since those were not in the article.

    If you give me your address, I can send you the file(s) I used for testing... but you shouldn't consider this as through test: it was just something quick to see how different tools counted the words in the same file.

    ReplyDelete
    Replies
    1. Hello Riccardo,

      I wanted to clarify a couple of things in here. First of all I changed the apostrophes in your test file to straight apostrophes and ran the comparison again.

      2011 - 188 words
      2014 - 188 words

      I also reran the test with your file in the latest build of 2011. This actually reports this:

      2011 - 193 words
      2014 - 188 words

      The reason this happens is because the only change that has really been made between Studio 2011 and Studio 2014 SP2 is that curly apostropes are now treated the same as straight apostropes. This is to ensure that we have consistency between all of our products (and closer to others too) and in the way these characters are treated. So in your text these are now one word instead of two:

      It’s
      It’s
      Godesburg’s
      Cologne’s
      region’s

      "It's" will continue to be a hotly debated topic as we know. But what about the last three? Are these two words too?

      Unfortunately the release notes were incorrectly written and these will be changed. Studio is still handling hypens/dashes as it always did so there are no changes here at all. I should also mention that there are inconsistencies here too so we can expect to see further improvements to allow us to provide as clear a wordcount as possible. This will all be done with feedback from our Beta testers, from our users, and you Riccardo to ensure we are as fair as possible.

      In terms of the actual changes to the overall analysis we have of course added alphanumeric handling which will affect the placeable count and I will update my article at some point soon to reflect the changes there. But generally I expect the alphanumerics to make life easier overall.

      So I think the problem here is mostly caused by us getting the release notes wrong, and then this change to ensure handling of apostrophes is consistent has affected the wordcount depending on the content. Fairly or unfairly? I don't think it's quite as clearcut as your article suggests and the title does suggest things are a lot worse than they actually are. This is very unfortunate in my opinion because the benefits to translators in this release are significant.

      Regards

      Paul

      Delete
  3. This comment has been removed by a blog administrator.

    ReplyDelete
  4. This latest version works like a charm!

    ReplyDelete

Thank you for your comment!

Unfortunately, comment spam has grown to the point that all comments need to be moderated. All legitimate comments will be published as soon as possible.