Browse DORAS
Browse Theses
Latest Additions
Creative Commons License
Except where otherwise noted, content on this site is licensed for use under a:

Part-of-speech tagging and partial parsing for Irish using finite-state transducers and constraint grammar

Uí Dhonnchadha, Elaine (2009) Part-of-speech tagging and partial parsing for Irish using finite-state transducers and constraint grammar. PhD thesis, Dublin City University.

Full text available as:

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader


In this thesis, we present the development and evaluation of a suite of annotation tools for unrestricted Irish text, which go from tokenization, morphological analysis, part-of-speech tagging, right through to partial parsing. In order to develop such tools, a large body of texts is required for testing purposes. We, therefore, begin by describing our involvement in the creation of a 30 million word corpus of Irish texts (New Corpus for Ireland). From this corpus, we randomly extracted 3,000 sentences which we annotated and manually corrected in order to create a Gold Standard Corpus for evaluation purposes. We then present the annotation tools. Firstly, we describe scaling a proof-of-concept implementation of finite-state tokenization and morphological analysis based on Xerox Finite State Tools (Uí Dhonnchadha, 2002, p146), to unrestricted text. After semi-automatic population of the finite-state morphology (FSM) lexical resources, the morphological analyser contains a lexicon of 30K lemmas, which together with a set of morphological guessers assign at least one morphological analysis to all tokens in unrestricted texts. Following this, we describe our POS tagger for Irish, implemented using Constraint Grammar Disambiguation Rules, and vislcg2 software. The POS tagger currently achieves an f-score of 95% on development data and 94.35% on unseen test data. This tagger has been used to tag the 30 million word corpus of Irish. Finally, we present our implementation of partial parsing, which is a combination of dependency analysis overlaid with finite-state chunking. As this is the first attempt at implementing a partial parser for Irish, (to our knowledge), there were no guidelines or precedents available. The dependency analysis uses Constraint Grammar Dependency Mapping Rules, and the chunker is implemented using regular expressions and Xerox Finite-State Tools. The dependency analysis currently achieves an f-score of 93.60% on development data and 94.28% on unseen test data. The chunker achieves an f-score of 97.20% on development data and 93.50% on unseen test data.

Item Type:Thesis (PhD)
Date of Award:March 2009
Supervisor(s):van Genabith, Josef
Uncontrolled Keywords:Irish; POS tagging; finite-state; morphology; partial parsing; constraint grammar;
Subjects:Computer Science > Computational linguistics
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Use License:This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License
ID Code:2349
Deposited On:02 Apr 2009 18:02 by Josef Vangenabith. Last Modified 02 Apr 2009 18:02

Download statistics

Archive Staff Only: edit this record