Barry, James, Wagner, Joachim ORCID: 0000-0002-8290-3849, Cassidy, Lauren, Cowap, Alan ORCID: 0000-0002-6300-6034, Lynn, Teresa, Walsh, Abigail, Ó Meachair, Mícheál J. ORCID: 0000-0003-3931-5571 and Foster, Jennifer ORCID: 0000-0002-7789-4853 (2022) gaBERT - an Irish language model. In: 13th Conference on Language Resources and Evaluation (LREC 2022), 20-25 June 2022, Marseille, France.
Abstract
The BERT family of neural language models have become highly popular due to their ability to provide sequences of text with rich context-sensitive token encodings which are able to generalise well to many NLP tasks. We introduce gaBERT, a monolingual BERT model for the Irish language. We compare our gaBERT model to multilingual BERT and the monolingual Irish WikiBERT, and we show that gaBERT provides better representations for a downstream parsing task. We also show how different filtering criteria, vocabulary size and the choice of subword tokenisation model affect downstream performance. We compare the results of fine-tuning a gaBERT model with an mBERT model for the task of identifying verbal multiword expressions, and show that the fine-tuned gaBERT model also performs better at this task. We release gaBERT and related code to the community.
Metadata
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Event Type: | Conference |
Refereed: | Yes |
Uncontrolled Keywords: | BERT, Irish |
Subjects: | Humanities > Irish language Humanities > Linguistics Social Sciences > Educational technology |
DCU Faculties and Centres: | DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing DCU Faculties and Schools > Faculty of Humanities and Social Science > Fiontar agus Scoil na Gaeilge Research Institutes and Centres > ADAPT |
Published in: | Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC 2022). . Association for Computational Linguistics (ACL). |
Publisher: | Association for Computational Linguistics (ACL) |
Official URL: | https://aclanthology.org/2022.lrec-1.511/ |
Copyright Information: | © European Language Resources Association (ELRA) |
Funders: | Science Foundation Ireland (SFI) through the ADAPT (Grant 13/RC/2106), European Regional Development Fund., Science Foundation Ireland Frontiers for the Future programme (19/FFP/6942), Science Foundation Ireland SFI Centre for Research Training in Machine Learning (18/CRT/6183), Irish Government Department of Culture, Heritage and the Gaeltacht under the GaelTech Project. |
ID Code: | 29098 |
Deposited On: | 29 Sep 2023 12:38 by Vidatum Academic . Last Modified 29 Sep 2023 12:38 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution-Noncommercial 4.0 963kB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record