Tuesday, August 02, 2022

MTPE of Poor Quality Source Texts: Some Practical Suggestions

To achieve the best MT results, you should first correct the source text, when it is a scanned hard copy or an automatic transcription of recorded speech. Here are a few practical suggestions:

  • Choose the correct settings before running OCR. In particular, select the correct source language (you’ll see better suggestions during verification), select the correct graphic resolution for each page and the correct text direction for each piece of text, and de-skew and clean each page that requires it. Verification should be run by someone familiar with both the source language and the subject.
  • Correct misspelled or wrongly transcribed words.
  • Add “[sic]” after any word that you cannot identify and that you suspect is an artifact of the OCR process. This helps the post-editor focus on problem areas.
  • Capitalize proper nouns and acronyms.
  • Lower case incorrectly capitalized words.
  • Reassemble sentences broken up by periods (hard returns) or new lines (soft returns).
  • Feed the source text to the MT engine only after completing such corrections; doing otherwise will yield substandard results and will take longer to post-edit.

When the source text is good, you can skip pre-editing, but, when it is questionable or poor, pre-editing enhances the quality of the resulting machine translation and helps the post-editor achieve the desired quality.