Tuesday, August 02, 2022

MTPE of Poor Quality Source Texts: Some Practical Suggestions

To achieve the best MT results, you should first correct the source text, when it is a scanned hard copy or an automatic transcription of recorded speech. Here are a few practical suggestions:

  • Choose the correct settings before running OCR. In particular, select the correct source language (you’ll see better suggestions during verification), select the correct graphic resolution for each page and the correct text direction for each piece of text, and de-skew and clean each page that requires it. Verification should be run by someone familiar with both the source language and the subject.
  • Correct misspelled or wrongly transcribed words.
  • Add “[sic]” after any word that you cannot identify and that you suspect is an artifact of the OCR process. This helps the post-editor focus on problem areas.
  • Capitalize proper nouns and acronyms.
  • Lower case incorrectly capitalized words.
  • Reassemble sentences broken up by periods (hard returns) or new lines (soft returns).
  • Feed the source text to the MT engine only after completing such corrections; doing otherwise will yield substandard results and will take longer to post-edit.

When the source text is good, you can skip pre-editing, but, when it is questionable or poor, pre-editing enhances the quality of the resulting machine translation and helps the post-editor achieve the desired quality.

3 comments:

  1. The best MT results are usually achieved by skipping the step involving the use of MT :-)

    ReplyDelete
  2. What about quality measurement practices prior to doing post-editing for translation?

    ReplyDelete
    Replies
    1. Hi Ed, if you mean checking the quality of the MT output before starting post-editing, yes, I think that is advisable... especially if you can do that before confirming acceptance of the job.

      Delete

Thank you for your comment!

Unfortunately, comment spam has grown to the point that all comments need to be moderated. All legitimate comments will be published as soon as possible.