Tuesday, November 18, 2014

Studio 2014 SP2: one step forward and one backward

SDL has just released Studio 2014 SP2. This upgrade no longer relies on Java, and should therefore fix all Java-related issues that have plagued the use of MultiTerm in Studio. So, thank you to SDL for finally fixing the Java problem.
If you read through the release notes of SP2, however, in addition to various improvements, there is also a major new issue:
11. Improved word count and search logic for words containing apostrophes and dashes
Studio 2014 SP2 uses an improved algorithm for processing words that contain dashes (-) or apostrophes (‘). This improvement translates into:
Lower word count. Studio no longer treats apostrophes and dashes as word separators, but as punctuation marks that link words together. This means that Studio counts elements like “it’s” or “splash-proof” as one single word.
I can see why certain translation agencies would consider this as an “improved” algorithm, and welcome such a misfeature (just another way to pay those pesky translators less). But why should translators consider this as an improvement?
I’ve run a test on a short MS Word file I created from a Wikipedia article (I have it available, if anybody wants to repeat my test):
The results are as follows:
  • Baseline: manual word count: 195 words
  • Trados 2007: 198 words (+1.5%)
  • Studio 2011: 195 words (=)
  • Studio 2014 SP1: 193 words (-1.0%)
  • memoQ 2014: 190 words. (-2.6%)
  • MS Word 2010: 190 words (-2.6%)
  • Studio 2014 SP2: 188 words (-3.6%)
As you can see, a translator who used to be paid based on a Trados 2007 word count would concede to the translation agency a 5.1% discount just by using 2014 SP2 instead.

What seems to be happening with words that may be counted differently

A subset of the file I used for the word count includes the following:
It’s
mid-16th century
Prince-electors
The others who were left in the keep—men, women and children—were killed.
According to my manual word count these are 21 words (I count two words each for “it’s”, “mid-16th”, “Prince-electors”, and of course I count as separate words “keep”, “men”, “children”, and “were”.)
According to MS Word, these are 18 words: it counts as single words “it’s” and the two hyphenated terms “mid-16th” and “Prince-electors”; however, it correctly counts as separate words “keep” and “men”, “children” and “were”.
According to Studio 2014 SP2, however, these are 16 words: Studio 2014 SP2 is not only counting as single words “It’s”, and the two hyphenated terms, but it also counts as single words those that are separated by an m-dash.
So either SDL’s programmers don’t know the difference between an hyphen and a dash and how they are used, or the way they have implemented the change contains a bug. The former option is suggested by SDL's own release notes, which do say
Studio 2014 SP2 uses an improved algorithm for processing words that contain dashes (-) [...] This means that Studio counts [...] “splash-proof” as a single word.
“Splash-proof”, of course, does not contain a dash: it contains an hyphen, and the distinction is important, especially when not knowing the difference between a dash and an hyphen results in a lowered word count.

UPDATE

According to SDL's release notes, dashes should actually be counted correctly:
Dashes that do not follow the new logic:
  • Figure dash (‒) 
  • En dash (–) 
  • Em dash (—) 
  • Horizontal bar (―) 
  • Small Em dash (﹘)
However, my test confirms that this is not the case: try copying "The others who were left in the keep—men, women and children—were killed" into a word file, and run an analysis in Studio 2014 SP2: you'll see that the two dashes are counted as hyphens, and that the word count for the sentence (which contains 14 words), indicates 12 words.


Friday, November 14, 2014

Some additional answers about Xbench

At the ATA Conference in Chicago I gave a presentation on how to use Xbench for terminology management and translation QA (you can see and download the presentation from the Xbench tab in this blog).

I believe that the presentation was well received, and that most people found the program very useful, but I was stumped by a few questions. I've now inquired with the Xbench developers at ApSIC, and they have provided the missing information:

Q. Is Xbench compatible with languages that use non-Roman alphabets (e.g., languages that use the Cyrillic alphabet)?
A. Yes, Xbench 3.0 uses Unicode, and is therefore compatible with other alphabets.

Q. Is Xbench compatible with double-byte languages?
A. Xbench's compatibility with double-byte languages is quite good (Japan is ApSIC's largest customer base after Spain, and Korea is quite big as well, China is the country with most active users and downloads), but there are some caveats. Xbench does not have heuristics in place to identify words within a DBCS strings, so some features that rely on whole words identification do not work well (for example if Chinese is the source language in a key terms check).

Q. Is Xbench compatible with bi-directional languages?
A. With Xbench 3.0 build 1266 (the current build as of now), compatibility is still poor, but ApSIC is actively working to improve bi-directional compatibility.

Q. What are the size limits for files loaded in Xbench?
A. For the 32-bit version, there is a limit of 2GB per file (and a maximum for all files loaded of 2 or 4 GB). For the 64-bit version the limit is the available memory and available swap disk. ApSIC recommends installing the 64-bit version if you have a 64-bit Windows. The 64-bit version used to have a limitation of 2GB per file (however, with an unlimited number of files), but now that limitation has been lifted, and files in excess of 2GB should work.

Please note that all these answers refer to version 3.0 of Xbench (the commercial version of the program).