Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

Using SMT for OCR error correction of historical texts

Afli, Haithem orcid logoORCID: 0000-0002-7449-4707, Qui, Zhengwei, Way, Andy orcid logoORCID: 0000-0001-5736-5930 and Sheridan, Páraic (2016) Using SMT for OCR error correction of historical texts. In: Tenth International Conference on Language Resources and Evaluation (LREC 2016), 23-28 May 2016, Portorož, Slovenia. ISBN 978-2-9517408-9-1

Abstract
A trend to digitize historical paper-based archives has emerged in recent years, with the advent of digital optical scanners. A lot of paper-based books, textbooks, magazines, articles, and documents are being transformed into electronic versions that can be manipulated by a computer. For this purpose, Optical Character Recognition (OCR) systems have been developed to transform scanned digital text into editable computer text. However, different kinds of errors in the OCR system output text can be found, but Automatic Error Correction tools can help in performing the quality of electronic texts by cleaning and removing noises. In this paper, we perform a qualitative and quantitative comparison of several error-correction techniques for historical French documents. Experimentation shows that our Machine Translation for Error Correction method is superior to other Language Modelling correction techniques, with nearly 13% relative improvement compared to the initial baseline.
Metadata
Item Type:Conference or Workshop Item (Paper)
Event Type:Conference
Refereed:Yes
Uncontrolled Keywords:Optical Character Recognition; Language Modelling; SpeechToSpeech Translation
Subjects:Computer Science > Machine translating
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Research Institutes and Centres > ADAPT
Research Institutes and Centres > Centre for Next Generation Localisation (CNGL)
Published in: Calzolari, Nicoletta, Choukri, Khalid, Declerck, Thierry and Goggi, Sara, (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). . European Language Resource Association. ISBN 978-2-9517408-9-1
Publisher:European Language Resource Association
Copyright Information:© 2016 ELRA
Use License:This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:Science Foundation Ireland through the TIDA Programme (Grant 14/TIDA/2384), ADAPT Centre (Grant 13/RC/2106) (www.adaptcentre.ie) at Dublin City University
ID Code:23226
Deposited On:02 May 2019 08:35 by Thomas Murtagh . Last Modified 16 May 2019 11:05
Documents

Full text available as:

[thumbnail of Using SMT for OCR Error Correction of Historical Texts.pdf]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
565kB
Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record