Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

Integrating optical character recognition and machine translation of historical documents

Afli, Haithem orcid logoORCID: 0000-0002-7449-4707 and Way, Andy orcid logoORCID: 0000-0001-5736-5930 (2016) Integrating optical character recognition and machine translation of historical documents. In: COLING, the 26th International Conference on Computational Linguistics, 13-16 Dec 2016, Osaka, Japan.

Abstract
Machine Translation (MT) plays a critical role in expanding capacity in the translation industry. However, many valuable documents, including digital documents, are encoded in non-accessible formats for machine processing (e.g., Historical or Legal documents). Such documents must be passed through a process of Optical Character Recognition (OCR) to render the text suitable for MT. No matter how good the OCR is, this process introduces recognition errors, which often renders MT ineffective. In this paper, we propose a new OCR to MT framework based on adding a new OCR error correction module to enhance the overall quality of translation. Experimentation shows that our new system correction based on the combination of Language Modeling and Translation methods outperforms the baseline system by nearly 30% relative improvement.
Metadata
Item Type:Conference or Workshop Item (Paper)
Event Type:Conference
Refereed:Yes
Subjects:Computer Science > Machine translating
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Research Institutes and Centres > ADAPT
Published in: Hinrichs, Erhard, Hinrichs, Marie and Trippel, Thorsten, (eds.) Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities(LT4DH). . COLING 2016 Organizing Committee Committee.
Publisher:COLING 2016 Organizing Committee Committee
Official URL:https://www.aclweb.org/anthology/W16-4015
Copyright Information:© the ACL
Use License:This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:Science Foundation Ireland in the ADAPT Centre (Grant 13/RC/2106) (www.adaptcentre.ie) at Dublin City University.
ID Code:23243
Deposited On:02 May 2019 14:48 by Thomas Murtagh . Last Modified 16 May 2019 11:05
Documents

Full text available as:

[thumbnail of Integrating Optical Character Recognition and Machine Translation of Historical Documents.pdf]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
378kB
Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record