Ó Raghallaigh, Brian
ORCID: 0000-0003-3813-1949, Palandri, Andrea and MacCárthaigh, Críostóir
(2022)
Handwritten Text Recognition (HTR) for Irish-Language Folklore.
In: 4th Celtic Language Technology Workshop within LREC2022, 20-25 June, 2022, Marseilles, France.
Abstract
In this paper we present our method for digitising a large collection of handwritten Irish-language texts as part of a project to mine information from a large corpus of Irish and Scottish Gaelic folktales. The handwritten texts form part of the Main Manuscript Collection of the National Folklore Collection of Ireland and contain handwritten transcriptions of oral folklore collected in Ireland in the 20th century. With the goal of creating a large text corpus of the Irish-language folktales contained within this collection, our method involves scanning the pages of the physical volumes and digitising the text on these pages using Transkribus, a platform for the recognition of historical documents. Given the nature of the collection, the approach we have taken involves the creation of individual text recognition models for multiple collectors' hands. Doing it this way was motivated by the fact that a relatively small number of collectors contributed the bulk of the material, while the differences between each collector in terms of style, layout and orthography were difficult to reconcile within a single handwriting model. We present our preliminary results along with a discussion on the viability of using crowdsourced correction to improve our HTR models.
Metadata
| Item Type: | Conference or Workshop Item (Paper) |
|---|---|
| Event Type: | Conference |
| Refereed: | Yes |
| Uncontrolled Keywords: | Digital folkloristics, handwritten text recognition, Irish language |
| Subjects: | Humanities > Irish language Humanities > Language Humanities > Linguistics |
| DCU Faculties and Centres: | DCU Faculties and Schools > Faculty of Humanities and Social Science DCU Faculties and Schools > Faculty of Humanities and Social Science > Fiontar agus Scoil na Gaeilge |
| Published in: | Fransen, Theodorus, Lamb, William and Prys, Delyth, (eds.) Proceedings of the 4th Celtic Language Technology Workshop within LREC2022. . European Language Resources Association. |
| Publisher: | European Language Resources Association |
| Official URL: | https://aclanthology.org/2022.cltw-1.17/ |
| Copyright Information: | Authors |
| ID Code: | 32181 |
| Deposited On: | 19 Jan 2026 11:15 by Andrea Palandri . Last Modified 19 Jan 2026 11:15 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution-Noncommercial 4.0 610kB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record