gaBERT — an Irish Language model

Barry, James ORCID: 0000-0003-3051-585X, Wagner, Joachim ORCID: 0000-0002-8290-3849, Cassidy, Lauren, Cowap, Alan ORCID: 0000-0002-6300-6034, Lynn, Teresa, Abigail, Walsh, Ó Meachair, Mícheál J. ORCID: 0000-0003-3931-5571 and Foster, Jennifer ORCID: 0000-0002-7789-4853 (2022) gaBERT — an Irish Language model. In: Thirteenth Language Resources and Evaluation Conference, LREC 2022, 20-25 June 2022, Marseille, France.

Abstract
Metadata
Downloads
Documents

[+][-]

Abstract

The BERT family of neural language models have become highly popular due to their ability to provide sequences of text with rich context-sensitive token encodings which are able to generalise well to many NLP tasks. We introduce gaBERT, a monolingual BERT model for the Irish language. We compare our gaBERT model to multilingual BERT and the monolingual Irish WikiBERT, and we show that gaBERT provides better representations for a downstream parsing task. We also show how different filtering criteria, vocabulary size and the choice of subword tokenisation model affect downstream performance. We compare the results of fine-tuning a gaBERT model with an mBERT model for the task of identifying verbal multiword expressions, and show that the fine-tuned gaBERT model also performs better at this task. We release gaBERT and related code to the community.

Metadata

Item Type:	Conference or Workshop Item (Paper)
Event Type:	Conference
Refereed:	Yes
Subjects:	Computer Science > Artificial intelligence Computer Science > Computational linguistics
DCU Faculties and Centres:	DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing DCU Faculties and Schools > Faculty of Humanities and Social Science > Fiontar agus Scoil na Gaeilge Research Institutes and Centres > ADAPT
Published in:	Proceedings of the Thirteenth Language Resources and Evaluation Conference. . European Language Resources Association (ELRA).
Publisher:	European Language Resources Association (ELRA)
Official URL:	https://aclanthology.org/2022.lrec-1.511/
Copyright Information:	© European Language Resources Association (ELRA)
Funders:	Science Foundation Ireland (Grant 13/RC/2106), European Regional Development Fund, Irish Government Department of Culture, Heritage and the Gaeltacht, Science Foundation Ireland (SFI) Frontiers for the Future programme (19/FFP/6942), Science Foundation Ireland (SFI) Centre for Research Training in Machine Learning (18/CRT/6183)
ID Code:	28293
Deposited On:	28 Apr 2023 09:12 by Joachim Wagner . Last Modified 28 Apr 2023 09:12

Documents

Full text available as:

Preview

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution-Noncommercial 4.0
963kB

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

DORAS | DCU Research Repository

gaBERT — an Irish Language model

Downloads