Part-of-speech tagging and partial parsing for Irish
using finite-state transducers and constraint grammar

Uí Dhonnchadha, Elaine

Part-of-speech tagging and partial parsing for Irish using finite-state transducers and constraint grammar

Abstract

In this thesis, we present the development and evaluation of a suite of annotation tools for unrestricted Irish text, which go from tokenization, morphological analysis, part-of-speech tagging, right through to partial parsing. In order to develop such tools, a large body of texts is required for testing purposes. We, therefore, begin by describing our involvement in the creation of a 30 million word corpus of Irish texts (New Corpus for Ireland). From this corpus, we randomly extracted 3,000 sentences which we annotated and manually corrected in order to create a Gold Standard Corpus for evaluation purposes. We then present the annotation tools. Firstly, we describe scaling a proof-of-concept implementation of finite-state tokenization and morphological analysis based on Xerox Finite State Tools (Uí Dhonnchadha, 2002, p146), to unrestricted text. After semi-automatic population of the finite-state morphology (FSM) lexical resources, the morphological analyser contains a lexicon of 30K lemmas, which together with a set of morphological guessers assign at least one morphological analysis to all tokens in unrestricted texts. Following this, we describe our POS tagger for Irish, implemented using Constraint Grammar Disambiguation Rules, and vislcg2 software. The POS tagger currently achieves an f-score of 95% on development data and 94.35% on unseen test data. This tagger has been used to tag the 30 million word corpus of Irish. Finally, we present our implementation of partial parsing, which is a combination of dependency analysis overlaid with finite-state chunking. As this is the first attempt at implementing a partial parser for Irish, (to our knowledge), there were no guidelines or precedents available. The dependency analysis uses Constraint Grammar Dependency Mapping Rules, and the chunker is implemented using regular expressions and Xerox Finite-State Tools. The dependency analysis currently achieves an f-score of 93.60% on development data and 94.28% on unseen test data. The chunker achieves an f-score of 97.20% on development data and 93.50% on unseen test data.

Item Type:

Thesis (PhD)

Date of Award:

March 2009

Refereed:

Supervisor(s):

van Genabith, Josef

Uncontrolled Keywords:

Irish; POS tagging; finite-state; morphology; partial parsing; constraint grammar;

Subjects:

Computer Science > Computational linguistics

DCU Faculties and Centres:

DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing

Use License:

This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License

ID Code:

2349

Deposited On:

02 Apr 2009 17:02 by Josef Vangenabith . Last Modified 19 Jul 2018 14:43

DORAS | DCU Research Repository

Part-of-speech tagging and partial parsing for Irish using finite-state transducers and constraint grammar

Downloads