Tuesday, May 28, 2013

Simple regular expressions for SDL Trados Studio filters


Regular expressions (regex for short) are very useful for searching, replacing and filtering information, and are increasingly available in many applications, including SDL Trados Studio (SDL's Paul Filkin has several articles in his Multifarious blog about sophisticated uses of regular expressions searches in Studio, for example Regular Expressions - Part 1 and Regex… and “economy of accuracy”).

Regular expressions, though, also suffer from a reputation of being difficult to learn and to understand. This reputation is well deserved: no matter how useful regular expressions may be, nobody can say that something that looks like "\b(0?[1-9]|1[012])[- /.](0?[1-9]|[12][0-9]|3[01])[- /.](19|20)?[0-9]{2}\b" is simple, easy to understand, or easy to construct.

Many people, therefore, after taking a look at regular expressions, decide they are not for them: they look too difficult. But while sophisticated uses or regular expression do tend to look forbidding, certain regular expressions are simple and amazingly useful.

Let's see an example: say that you are translating a long document about painting systems, and that you want to check all the segments in which the term "topcoat" appears. Since Studio has a very useful filter feature, you know you can enter the word "topcoat" in the filter, and obtain all the segments in which it appears.

Unfortunately, though, you noticed during your translation that the source language is not very consistent: sometimes "topcoat" is written as a single word, sometimes as two separate words ("top coat"), and sometimes the author used a mid-way solution and hyphenated the term ("top-coat"). You can certainly use the filter three times entering the three different versions of the term, and find all the segments that contain each. But using regular expressions it is also possible to do it all at once: use a single search expression to find "topcoat", "top coat" and "top-coat".

To do so, enter in the filter top.?coat.

What does this regex string do?

It searches for all terms that contain the sequence of letters "top", followed by any character (the dot) repeated zero or one times (the question mark), followed by the sequence of letters "coat".

Using an expression that lets us search for any character we were able to find those instances in which "top" and "coat" are separated by a hyphen or a space, and by telling it to search for that character only once or zero times, we were able to also search for those instances in which "top" are "coat" are attached, while excluding longer strings in which "top" and "coat" appear separated by more than one character (we do not want the filter to also return "when you paint the top, make sure you are not coating the sides as well" - which we would get if we had used in the filter top.*coat, instead)

More powerful regular expressions may look difficult - but you can start using simple ones, which are nonetheless very useful.

4 comments:

  1. Great post, thank you so much! I'm one of those people who is reasonably tech-savvy but constantly baffled by regular expressions (such as the example you gave!) This is a great demonstration of how to write simple ones, and how they could be very useful!!

    ReplyDelete
  2. Thanks a lot for this! Despite my background in IT and programming, I often hit trouble with regex - especially as not all software that allow regex work the same way.

    ReplyDelete

Thank you for your comment!

Unfortunately, comment spam has grown to the point that all comments need to be moderated. All legitimate comments will be published as soon as possible.