Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

A tool for facilitating OCR postediting in historical documents

Poncelas, Alberto orcid logoORCID: 0000-0002-5089-1687, Aboomar, Mohammad orcid logoORCID: 0000-0002-1391-5061, Buts, Jan orcid logoORCID: 0000-0002-7657-804X, Hadley, James orcid logoORCID: 0000-0003-1950-2679 and Way, Andy orcid logoORCID: 0000-0001-5736-5930 (2020) A tool for facilitating OCR postediting in historical documents. In: Workshop on Language Technologies for Historical and Ancient Languages, LT4HALA (2020), 11-16 May 2020, Marseille, France.

Optical character recognition (OCR) for historical documents is a complex procedure subject to a unique set of material issues, including inconsistencies in typefaces and low quality scanning. Consequently, even the most sophisticated OCR engines produce errors. This paper reports on a tool built for postediting the output of Tesseract, more specifically for correcting common errors in digitized historical documents. The proposed tool suggests alternatives for word forms not found in a specified vocabulary. The assumed error is replaced by a presumably correct alternative in the post-edition based on the scores of a Language Model (LM). The tool is tested on a chapter of the book An Essay Towards Regulating the Trade and Employing the Poor of this Kingdom. As demonstrated below, the tool is successful in correcting a number of common errors. If sometimes unreliable, it is also transparent and subject to human intervention.
Item Type:Conference or Workshop Item (Paper)
Event Type:Workshop
Additional Information:Colocated with LREC 2020 Workshop Language Resources and Evaluation Conference Due to the COVID-19 pandemic, the workshop will not take place. However, the proceedings are published online.
Subjects:Computer Science > Digital electronics
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Research Institutes and Centres > ADAPT
Published in: Sprugnoli, Rachele and Passarotti, Marco, (eds.) Proceedings of 1st Workshop on Language Technologies for Historical and Ancient Languages. . LREC.
Official URL:https://aclanthology.org/2020.lt4hala-1.7.pdf
Copyright Information:© 2020 The Authors
Use License:This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:Irish Research Council’s COALESCE scheme (COALESCE/2019/117), SFI Research Centres Programme (Grant 13/RC/2106)
ID Code:24441
Deposited On:11 May 2020 15:11 by Alberto Poncelas . Last Modified 07 Jan 2022 16:41

Full text available as:

[thumbnail of LT4HALA2020.pdf]
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader


Downloads per month over past year

Archive Staff Only: edit this record