Uí Dhonnchadha, Elaine (2009) Part-of-speech tagging and partial parsing for Irish using finite-state transducers and constraint grammar. PhD thesis, Dublin City University.
Abstract
In this thesis, we present the development and evaluation of a suite of annotation tools for unrestricted Irish text, which go from tokenization, morphological analysis, part-of-speech tagging, right through to partial parsing. In order to develop such tools, a large body of texts is required for testing purposes. We, therefore, begin by describing our involvement in the creation of a 30 million word corpus of Irish texts (New Corpus for Ireland). From this corpus,
we randomly extracted 3,000 sentences which we annotated and manually corrected in order to create a Gold Standard Corpus for evaluation purposes. We then present the annotation tools. Firstly, we describe scaling a proof-of-concept implementation of finite-state tokenization and morphological analysis based on Xerox Finite State Tools (Uí Dhonnchadha, 2002, p146), to unrestricted text. After semi-automatic population of the finite-state morphology (FSM) lexical resources, the morphological analyser
contains a lexicon of 30K lemmas, which together with a set of morphological guessers assign at least one morphological analysis to all tokens in unrestricted texts. Following this, we describe our POS tagger for Irish, implemented using Constraint Grammar Disambiguation Rules, and vislcg2 software. The POS tagger currently achieves an f-score
of 95% on development data and 94.35% on unseen test data. This tagger has been used to tag the 30 million word corpus of Irish. Finally, we present our implementation of partial parsing, which is a combination of dependency analysis overlaid with finite-state chunking. As this is the first attempt at implementing a partial parser for Irish, (to our knowledge), there were no guidelines or precedents available. The dependency analysis uses Constraint Grammar Dependency Mapping Rules, and the chunker is implemented using regular expressions and Xerox Finite-State Tools. The dependency analysis currently achieves an f-score of 93.60% on development data and 94.28% on unseen test data. The chunker achieves an f-score of 97.20% on development data and 93.50% on unseen test data.
Metadata
Item Type: | Thesis (PhD) |
---|---|
Date of Award: | March 2009 |
Refereed: | No |
Supervisor(s): | van Genabith, Josef |
Uncontrolled Keywords: | Irish; POS tagging; finite-state; morphology; partial parsing; constraint grammar; |
Subjects: | Computer Science > Computational linguistics |
DCU Faculties and Centres: | DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing |
Use License: | This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License |
ID Code: | 2349 |
Deposited On: | 02 Apr 2009 17:02 by Josef Vangenabith . Last Modified 19 Jul 2018 14:43 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
2MB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record