Monday, May 23, 2005

Interesting article on Google Translator

Google Blogoscoped (Philipp Lenssen) has an interesting article on the current state of the Google machine translation system: Google Translator: The Universal Language.

The article was followed by some lively discussion, and was followed by another interesting article on the Qwikly.com blog.

Tuesday, May 17, 2005

How to use wildcard and format searches in MSWord to make sure all your numbers are formatted correctly

(c) Riccardo Schiaffino, 2005

Introduction: The Problem

A known drawback of translating using Trados is that segments which contain only numbers cannot be opened in the translation memory tool.
This can be a problem when the document to translate contains tables of numbers: for example, you might be translating English into Italian, and you want to make sure that all numbers are formatted correctly, with a comma to separate decimals and a dot to separate thousands.

Makeshift or wrong solutions

Of course, once you have completed your translation, you can go back and manually change all those dots into commas, and the commas that separate the thousands into dots... but unless it's only a question of just a few numbers, this is a very boring and error-prone activity (are you sure you are not leaving anything behind? ... there were several pages with numeric tables: are you really sure?)
Next you might think that a simple search and replace may solve your problem: Search for ".", replace with "," and... wait a minute: this would really mess-up punctuation everywhere, wouldn't it?

A better approach: regular expressions

Maybe a more refined search?
We are on the right track, now: a good solution would be to use a regular expression search (which, in MS Word, is called a "wild card" search).
Regular expressions are wild card on steroids: When we think of wild cards, we normally think of "*" to mean "multiple characters", and "?" to mean "any single character" (for instance, if you search for file in Windows and your search string is *.doc, you'll find all files with a "doc" extension, while if you search for "?and.doc" you may find "wand.doc", "land.doc", etc.). With regular expressions you can do that, and much more.

A simple regular expression search

As an example of a regular expression (or "wildcard") search, if we go back to our original problem, we can perform the following search:
  1. In the Find field type "([0-9]).".
This means "any digit, followed by a dot".
  1. In the Replace field type "\1,".
This means "replace whatever digit you have found with the same digit, but followed by a comma instead than by a dot".

A few necessary refinements

First of all, we don't want our search to also find numbers in the source language segments or in segments that we translated: the source language segments should be left as they are, and in the translated ones we have presumably already taken care of correctly formatting the numbers embedded in the text.
Add color to your search
One good way to achieve this is to add color to your search: if we have set Trados up so that different types of segments use different colors (for instance blue for source language text, dark green for 100% matches, etc.) we can limit our search and replace operation to text that uses the default ("automatic") font color: this would be the part of the documents that have not been opened by Trados, i.e., our numeric tables.
  1. In order to do this, in both the Find and the Replace field, add a Format search:
  2. Click the More button, if your Find and Replace is not already expanded
  3. Click the Format button
  4. Select Font in the drop down list
  5. In the Find Font dialog, click on the Font color drop down list and select "automatic" as the color
  6. Click OK
Refine the wildcard search
Besides changing the decimal dot used in English into the comma used in Italian, we also need to change the English thousands separator (the comma) into the Italian one (the dot), in order not to end up with something like "1,411,12".
If for example a line of the numbers we need to reformat is as follows:

123.11
1,411.12
321.03
1,241,345.41

In order to do this, we need to perform our search and replace in three stages:
First search for all the thousands separators, and replace them with an arbitrary symbol (not yet a dot):

  1. In the Find field type ",([0-9]{3})", i.e., "search for a comma, followed by three digits"
  2. In the Replace field type "##\1", i.e. "two '#' characters, followed by the three-digit number we found" (you can, of course, use other symbols instead of "##", so long as they are not likely to be used in the document where you are performing your search.)
At this point our example line will be as follows:
123.11
1##411.12
321.03
1##241##345.41
Then search for the decimal dots, and replace them with commas, as we did in our simple regular expression search above:
  1. In the Find field type "([0-9]).".
  2. In the Replace field type "\1,".
Our example line will have changed to:
123,11
1##411,12
321,03
1##241##345,41
Finally, search for our arbitrary character "##" and replace it with the correct thousands separator (the dot):
  1. In the Find field type "##([0-9]{3})", i.e., "search for two '#'s, followed by three digits"
  2. In the Replace field type ".\1", i.e. "a dot, followed by the three-digit number we found".
Our example line is now correct for Italian:
123,11
1.411,12
321,03
1.241.345,41
It is fairly easy to change the above searches so as to format your number to suite your target country's standards.

Another wildcard search: how to exclude 100% matches from editing

Another occasion when I find wildcard searches useful is when I have edit some file that my customer sens me already pre-translated in Trados (but not X-Translated), with the indication that 100% should not be touched (and will not be paid).
In this case I use the following search string to find only 0% or fuzzy matches and skip the 100% matches:
\<\}[0-9]{1,2}\{\>
This tells Word to search for a "<}", followed by a one- or two-digit number (but not a three-digit number), followed by "{>"
With this search string Word will find all the delimiters between SL and TL, but only in segments up to 99% match, while skipping all 100% matches.

Note:
If you decide to use this search string (or a similar one) while editing files translated with Trados with Workbench open, be aware of a bug: if you open the segment in Workbench before first closing the Find window, the segment will be opened with corrupt characters ("-{}-") at the beginning, and it will not be possible to close it (error message: "no segment appears to be open").
My workaround for this is:


  1. Do all corrections possible without opening the segments in WOrkbench
  2. When a segment has to be opened, be sure to close the Find window first.

Conclusion

I hope to have given you an idea of the kind of things you can do with regular expression searches, and of how useful they can be.
There is a lot more that you can do with regular expression searches in MS Word (and in other tools). You can find a good introduction to wildcard searches in MS Word on the following web page: http://word.mvps.org/FAQs/General/UsingWildcards.htm. A good introductory book is "Regular Expressions in 10 minutes", by Ben Forta (Sams teach Yourself series); it covers regular expressions in general, of which Word wild card searches are only a subset.