A tool for facilitating OCR postediting in historical documents

Poncelas, Alberto; Aboomar, Mohammad; Buts, Jan; Hadley, James; Way, Andy

Poncelas, Alberto ORCID: 0000-0002-5089-1687, Aboomar, Mohammad ORCID: 0000-0002-1391-5061, Buts, Jan ORCID: 0000-0002-7657-804X, Hadley, James ORCID: 0000-0003-1950-2679 and Way, Andy ORCID: 0000-0001-5736-5930 (2020) A tool for facilitating OCR postediting in historical documents. In: Workshop on Language Technologies for Historical and Ancient Languages, LT4HALA (2020), 11-16 May 2020, Marseille, France.

Abstract
Metadata
Downloads
Documents

[+][-]

Abstract

Optical character recognition (OCR) for historical documents is a complex procedure subject to a unique set of material issues, including inconsistencies in typefaces and low quality scanning. Consequently, even the most sophisticated OCR engines produce errors. This paper reports on a tool built for postediting the output of Tesseract, more specifically for correcting common errors in digitized historical documents. The proposed tool suggests alternatives for word forms not found in a specified vocabulary. The assumed error is replaced by a presumably correct alternative in the post-edition based on the scores of a Language Model (LM). The tool is tested on a chapter of the book An Essay Towards Regulating the Trade and Employing the Poor of this Kingdom. As demonstrated below, the tool is successful in correcting a number of common errors. If sometimes unreliable, it is also transparent and subject to human intervention.

Metadata

Item Type:	Conference or Workshop Item (Paper)
Event Type:	Workshop
Refereed:	Yes
Additional Information:	Colocated with LREC 2020 Workshop Language Resources and Evaluation Conference Due to the COVID-19 pandemic, the workshop will not take place. However, the proceedings are published online.
Subjects:	Computer Science > Digital electronics
DCU Faculties and Centres:	DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing Research Institutes and Centres > ADAPT
Published in:	Sprugnoli, Rachele and Passarotti, Marco, (eds.) Proceedings of 1st Workshop on Language Technologies for Historical and Ancient Languages. . LREC.
Publisher:	LREC
Official URL:	https://aclanthology.org/2020.lt4hala-1.7.pdf
Copyright Information:	© 2020 The Authors
Use License:	This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:	Irish Research Council’s COALESCE scheme (COALESCE/2019/117), SFI Research Centres Programme (Grant 13/RC/2106)
ID Code:	24441
Deposited On:	11 May 2020 15:11 by Alberto Poncelas . Last Modified 07 Jan 2022 16:41

Documents

Full text available as:

Preview

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
232kB

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

DORAS | DCU Research Repository

A tool for facilitating OCR postediting in historical documents

Downloads