Integrating optical character recognition and machine translation of
historical documents
Afli, HaithemORCID: 0000-0002-7449-4707 and Way, AndyORCID: 0000-0001-5736-5930
(2016)
Integrating optical character recognition and machine translation of
historical documents.
In: COLING, the 26th International Conference on Computational Linguistics, 13-16 Dec 2016, Osaka, Japan.
Machine Translation (MT) plays a critical role in expanding capacity in the translation industry.
However, many valuable documents, including digital documents, are encoded in non-accessible
formats for machine processing (e.g., Historical or Legal documents). Such documents must be
passed through a process of Optical Character Recognition (OCR) to render the text suitable for
MT. No matter how good the OCR is, this process introduces recognition errors, which often
renders MT ineffective. In this paper, we propose a new OCR to MT framework based on adding
a new OCR error correction module to enhance the overall quality of translation. Experimentation shows that our new system correction based on the combination of Language Modeling and
Translation methods outperforms the baseline system by nearly 30% relative improvement.
Hinrichs, Erhard, Hinrichs, Marie and Trippel, Thorsten, (eds.)
Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities(LT4DH).
.
COLING 2016 Organizing Committee Committee.