Tuesday, May 02, 2006

Another Useful Wildcard Search

(c) Riccardo Schiaffino 2006
When working with Trados and MS Word, I often take advantage of the fairly powerful wildcard search options of Word - which are really a scaled-down and non-standard version of regular expressions. In a previous article from last year (How to use wildcard and format searches in MSWord to make sure all your numbers are formatted correctly), I showed how wildcard searches could be used to make sure that numbers in a translation are formatted properly according to the target language rules. In this post we are going to see another way wildcard searches may be of use to translators when working with Trados.

One of the tasks I regularly use MS Word wildcard searches for is to make sure that index entries in Framemaker's .mif files that I'm translating as rtf (after conversion with S-tagger) are formatted correctly: according to the style guide I have to follow, index entries in Italian should normally start with a lower-case letter (unless they are the name of some program).

Problem is, index entries are, by their very nature, standalone segments (which normally start with an upper case letter), and also segments that are very likely to be used elsewhere: "Program Installation" may be a section title, and, at the same time, an index entry in English. In Italian, however, I need to have "Installazione programma" for the title, and "installazione programma" for the corresponding index entry.

Working in Trados with a large memory, with segments that come from other translators and other projects, it is often easy to have the various index entries already translated from perfect matches, and, likely with a mismatch of upper and lower cases.

I thought that the best solution would be some search string able to find only index entries that, in Italian, begin with an upper case letter. At that point I could manually make them lower case by pressing F3, or leave them as is when they actually needed to be upper case.

The first part of the search string was going to be easier, as all index entries begin with either the <il> or <ie> markup.

So I knew that my search string needed to begin with

\<i[el]\>

This means:

  • \< - Find all the strings that begin with the "open markup" sign (the open angle bracket "<"; the backslash character "\" is used to indicate that the character that follows needs to be taken literally, and is necessary because the angle bracket characters otherwise have special meaning within wildcard searches.

  • i - Followed by an "i"

  • [el] - Followed by either an "e" or an "l" (the square brackets surrounding "el" group the alternate valid characters. <ie> and <il> are two markups that precede index entries in .mif files)

  • \> - Followed by the "close "markup" sign.

Now we need to search beyond the entire English source segment, whatever it contains, until we reach the first letter of the Italian one. In order to do this, we can take advantage of the Trados source segment delimiters "{0>" and "<}0{>".

Therefore the search strings needs to continue with

\{0\>[A-Za-z,;:\-\*\!\?\(\)\\\/"'=.£%&+\@#°_ 0-9]{1,255}\<\}[0-9]{1,3}\{\>

This looks quite complicated and unreadable (fine-tuning this part of the search string took quite a long time, and it probably is still not perfect). It means:

  • \{0\> - Trados markup to indicate the begin of the source language string (the first backslash character indicates that the open bracket "{" needs to be taken literally, since on its own it has other uses within the wildcard search, as we shall see presently)
  • [A-Za-z,;:\-\*\!\?\(\)\\\/"'=.£%&+\@#°_ 0-9] - All the characters that could be contained within the source language string. Again, backslashes precede characters that otherwise would have special meaning within the wildcard search. The square brackets are used again to group all the possible characters.

    Now, let's explain a little further these "all possible characters":

    • A-Za-z - All alphabetical characters
    • ,;: - Comma, semi-colon and colon
    • \-\*\!\?\(\) - Various punctuation and symbol marks (-*!?()each preceded by the backslash to indicate it has to be taken literally)
    • \\ - The backslash "\" symbol itself (when it is doubled thus, the first backslash indicates that the second one is to be taken literally)
    • \/ - The forward slash "/"
    • "'=.£%&+\@#°_ - Various other punctuation and other symbols (double-quote, single-quote, equal sign, full stop, etc., up to the underscore sign "_"
    • - The space " " (sorry, cannot show a space in red...)
    • 0-9 - All numerical characters
    Some of these "special characters" might not be necessary: it depends on whether they could actually be present within an index entry. however, if I have forgotten to include any character that actually occurred within an index entry, my search would not work properly, as it would stop at the first unrecognized character.
  • {1,255} - Here is one of the special uses of the brackets within wildcard searches: they are used to indicate how many characters (any combinations of the previously listed ones from "A-Z" through "0-9" can be contained in the previous part of the search. "1,255" means "from a single character through the maximum allowed (which unfortunately is only 255).
  • \<\}[0-9]{1,3}\{\> - Trados markup to indicate the end of the source language string and the beginning of the target language.

    • \<\} - Beginning of the markup used by Trados between SL and TL
    • [0-9] - Indicates that the markup may contain here any number
    • {1,3} - Indicates that the number contained in the markup may be between 1 and three digits (in fact, between 0 and 100)
    • \{\> - End of the markup used by Trados between SL and TL

Finally we need to indicate that we are looking only for those index entries in which the target language strings begins with an upper case:

[A-Z] - That is "All upper case alphabetical characters between 'A' and 'Z'"

Our complete search string will therefore be:

\<i[el]\>\{0\>[A-Za-z,;:\-\*\!\?\(\)\\\/"'=.£%&+\@#°_ 0-9]{1,255}\<\}[0-9]{1,3}\{\>[A-Z]

This needs to be typed exactly as is in Word's search dialog.

I keep a text file with all the wildcard search strings I know I'm going to use in the future, and when I need them I copy from the text file to Word's search dialog, and I suggest doing the same if you start using wildcard searches.

Wildcard searches are probably not for everybody: they look cryptic, may be very complicated, and usually take a fair amount of time to get right. On the other hand, as we have seen, they may help solving problems that may be difficult to solve any other way.

If you are interested in more information about wildcard searches, my previous post) contained some references. In addition to those, I suggest a book on regular expression that has been published recently, and that contains an entire chapter devoted to wildcard searches in MS Word: Andrew Watt's Beginning Regular Expression, published by Wrox.

2 comments:

  1. Hi there,

    Thank you for making this post. I am in need of a similar, although much simpler problem. I use Trados to translate xml file contents from Japanese to English. The contents of the files are compiled in an MS Word document and we then translate them into English. As you probably already know, .xml titles MUST follow a given protocol for capitalization. Some of the titles must be UPPER CASE while others must be Cap and Low. At present what we do to check our titles that have to be in the same format.

    Is there any way to use a wildcard search to find entries that are:

    1. Contained within a specific tag (html or xml).

    2. Exclusively ALL CAPS or exclusively Cap And Low

    ?

    I would really appreciate your help with this.

    Sincerely,

    Miguel

    ReplyDelete
  2. Ok.... its simple wow.... I tried it and it totally works. Bravo.

    ReplyDelete

Thank you for your comment!

Unfortunately, comment spam has grown to the point that all comments need to be moderated. All legitimate comments will be published as soon as possible.