The BERT family of neural language models have become highly popular due to their ability to provide sequences of text with rich context-sensitive token encodings which are able to generalise well to many NLP tasks. We introduce gaBERT, a monolingual BERT model for the Irish language. We compare our gaBERT model to multilingual BERT and the monolingual Irish WikiBERT, and we show that gaBERT provides better representations for a downstream parsing task. We also show how different filtering criteria, vocabulary size and the choice of subword tokenisation model affect downstream performance. We compare the results of fine-tuning a gaBERT model with an mBERT model for the task of identifying verbal multiword expressions, and show that the fine-tuned gaBERT model also performs better at this task. We release gaBERT and related code to the community.
Science Foundation Ireland (Grant 13/RC/2106), European Regional Development Fund, Irish Government Department of Culture, Heritage and the Gaeltacht, Science Foundation Ireland (SFI) Frontiers for the Future programme (19/FFP/6942), Science Foundation Ireland (SFI) Centre for Research Training in Machine Learning (18/CRT/6183)
ID Code:
28293
Deposited On:
28 Apr 2023 09:12 by Joachim Wagner. Last Modified 28 Apr 2023 09:12