Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

gaBERT - an Irish language model

Barry, James, Wagner, Joachim orcid logoORCID: 0000-0002-8290-3849, Cassidy, Lauren, Cowap, Alan orcid logoORCID: 0000-0002-6300-6034, Lynn, Teresa, Walsh, Abigail, Ó Meachair, Mícheál J. orcid logoORCID: 0000-0003-3931-5571 and Foster, Jennifer orcid logoORCID: 0000-0002-7789-4853 (2022) gaBERT - an Irish language model. In: 13th Conference on Language Resources and Evaluation (LREC 2022), 20-25 June 2022, Marseille, France.

Abstract
The BERT family of neural language models have become highly popular due to their ability to provide sequences of text with rich context-sensitive token encodings which are able to generalise well to many NLP tasks. We introduce gaBERT, a monolingual BERT model for the Irish language. We compare our gaBERT model to multilingual BERT and the monolingual Irish WikiBERT, and we show that gaBERT provides better representations for a downstream parsing task. We also show how different filtering criteria, vocabulary size and the choice of subword tokenisation model affect downstream performance. We compare the results of fine-tuning a gaBERT model with an mBERT model for the task of identifying verbal multiword expressions, and show that the fine-tuned gaBERT model also performs better at this task. We release gaBERT and related code to the community.
Metadata
Item Type:Conference or Workshop Item (Paper)
Event Type:Conference
Refereed:Yes
Uncontrolled Keywords:BERT, Irish
Subjects:Humanities > Irish language
Humanities > Linguistics
Social Sciences > Educational technology
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
DCU Faculties and Schools > Faculty of Humanities and Social Science > Fiontar agus Scoil na Gaeilge
Research Institutes and Centres > ADAPT
Published in: Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC 2022). . Association for Computational Linguistics (ACL).
Publisher:Association for Computational Linguistics (ACL)
Official URL:https://aclanthology.org/2022.lrec-1.511/
Copyright Information:© European Language Resources Association (ELRA)
Funders:Science Foundation Ireland (SFI) through the ADAPT (Grant 13/RC/2106), European Regional Development Fund., Science Foundation Ireland Frontiers for the Future programme (19/FFP/6942), Science Foundation Ireland SFI Centre for Research Training in Machine Learning (18/CRT/6183), Irish Government Department of Culture, Heritage and the Gaeltacht under the GaelTech Project.
ID Code:29098
Deposited On:29 Sep 2023 12:38 by Vidatum Academic . Last Modified 29 Sep 2023 12:38
Documents

Full text available as:

[thumbnail of 2022.lrec-1.511.pdf]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution-Noncommercial 4.0
963kB
Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record