Automatic syntactic parsing of user-generated content in Modern Irish poses significant challenges due to the language’s minority status and limited linguistic resources. In this thesis, we present TwittIrish, the first Universal Dependencies treebank of tweets in Irish, a linguistically-informed, genre-specific dataset developed via a cycle of automatic syntactic annotation and manual correction. We use this novel resource to document and quantify the linguistic differences between Irish tweets and standardised Irish text with regard to orthography, morphology, lexicon, and syntax. We provide examples of linguistic features observed in the tweets and describe how we have chosen to represent them within the Universal Dependencies framework. Furthermore, utilise the TwittIrish dataset to estab- lish baseline parsing results and explore methods to increase parsing accuracy. We show that the use of monolingual Irish BERT embeddings provides a significant improvement over baseline results. Our error analysis shows that language contact phenomena consti- tute one of the greatest challenges associated with processing informal Irish text. We, therefore, extend our analysis of user-generated content to examine language contact in Irish-language tweets. Due to centuries of contact with English, code-switching, borrow- ing, and other language contact phenomena are frequent in informal Irish. We investigate the perceptions of Irish speakers with regard to language contact in the Irish-English language pair. Furthermore, we assess the advantages and disadvantages of distinguishing between code-switching and borrowing in the context of resource development for natural language processing. Our research contributes to language technology support for a low-resource language by providing a novel data set and facilitating more accurate de- pendency parsing of informal Irish. Additionally, the exploration of linguistic features of Irish-language tweets extends the impact of this research to linguistics, sociolinguistics, and the Irish-language community more broadly by enhancing the general understanding of the use of Irish on social media.
Metadata
Item Type:
Thesis (PhD)
Date of Award:
March 2024
Refereed:
No
Supervisor(s):
Foster, Jennifer and Lynn, Teresa
Uncontrolled Keywords:
Irish Natural Language Processing, Dependency Parsing, Irish Social Media Analysis