Lankford, Séamus, Afli, Haithem ORCID: 0000-0002-7449-4707, Ní Loinsigh, Orla and Way, Andy ORCID: 0000-0001-5736-5930 (2022) gaHealth: An English–Irish bilingual corpus of health data. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference, 20-25 June 2022, Marseille, France.
Abstract
Machine Translation is a mature technology for many high-resource language pairs. However in the context of low-resource languages, there is a paucity of parallel data datasets available for developing translation models. Furthermore, the development of datasets for low-resource languages often focuses on simply creating the largest possible dataset for generic translation. The benefits and development of smaller in-domain datasets can easily be overlooked. To assess the merits of using in-domain data, a dataset for the specific domain of health was developed for the low-resource English to Irish language pair. Our study outlines the process used in developing the corpus and empirically demonstrates the benefits of using an in-domain dataset for the health domain. In the context of translating health-related data, models developed using the gaHealth corpus demonstrated a maximum BLEU score improvement of 22.2 points (40%) when compared with top performing models from the LoResMT2021 Shared Task. Furthermore, we define linguistic guidelines for developing gaHealth, the first bilingual corpus of health data for the Irish language, which we hope will be of use to other creators of low-resource data sets. gaHealth is now freely available online and is ready to be explored for further research.
Metadata
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Event Type: | Conference |
Refereed: | Yes |
Uncontrolled Keywords: | Health data; parallel corpus; machine translation; Irish |
Subjects: | Computer Science > Computational linguistics Computer Science > Machine translating |
DCU Faculties and Centres: | DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing Research Institutes and Centres > ADAPT |
Published in: | Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022). . European Language Resources Association (ELRA). |
Publisher: | European Language Resources Association (ELRA) |
Official URL: | http://www.lrec-conf.org/proceedings/lrec2022/pdf/... |
Copyright Information: | © 2022 European Language Resources Association (ELRA) |
Funders: | Science Foundation Ireland through ADAPT Centre (Grant 13/RC/2106), Munster Technological University, National Relay Station (NRS) of Ireland |
ID Code: | 28339 |
Deposited On: | 18 May 2023 12:09 by Seamus Lankford . Last Modified 19 May 2023 11:29 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution-Noncommercial 4.0 359kB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record