Afli, Haithem ORCID: 0000-0002-7449-4707 and Way, Andy ORCID: 0000-0001-5736-5930 (2016) Integrating optical character recognition and machine translation of historical documents. In: COLING, the 26th International Conference on Computational Linguistics, 13-16 Dec 2016, Osaka, Japan.
Abstract
Machine Translation (MT) plays a critical role in expanding capacity in the translation industry.
However, many valuable documents, including digital documents, are encoded in non-accessible
formats for machine processing (e.g., Historical or Legal documents). Such documents must be
passed through a process of Optical Character Recognition (OCR) to render the text suitable for
MT. No matter how good the OCR is, this process introduces recognition errors, which often
renders MT ineffective. In this paper, we propose a new OCR to MT framework based on adding
a new OCR error correction module to enhance the overall quality of translation. Experimentation shows that our new system correction based on the combination of Language Modeling and
Translation methods outperforms the baseline system by nearly 30% relative improvement.
Metadata
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Event Type: | Conference |
Refereed: | Yes |
Subjects: | Computer Science > Machine translating |
DCU Faculties and Centres: | DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing Research Institutes and Centres > ADAPT |
Published in: | Hinrichs, Erhard, Hinrichs, Marie and Trippel, Thorsten, (eds.) Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities(LT4DH). . COLING 2016 Organizing Committee Committee. |
Publisher: | COLING 2016 Organizing Committee Committee |
Official URL: | https://www.aclweb.org/anthology/W16-4015 |
Copyright Information: | © the ACL |
Use License: | This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License |
Funders: | Science Foundation Ireland in the ADAPT Centre (Grant 13/RC/2106) (www.adaptcentre.ie) at Dublin City University. |
ID Code: | 23243 |
Deposited On: | 02 May 2019 14:48 by Thomas Murtagh . Last Modified 16 May 2019 11:05 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
378kB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record