McCarthy, Mairéad (2003) Design and evaluation of the linguistic basis of an automatic F-struture annotation algorithm for the Penn-II treebank. Master of Science thesis, Dublin City University.
Abstract
In this thesis, we describe the design and evaluation of the linguistic basis of an automatic f-structure annotation algorithm for the Wall Street Journal (WSJ) section of the Penn-II Treebank, which consists of more than 1,000,000 words, tagged for part~of-speech information, in about 50,000 sentences and trees. We discuss the background and some of the main principles of Lexical- Functional Grammar (LFG), which is the theory of language used to represent the predicate-argument-modifier structure of a sentence by us in our application.
We then present the guidelines for the tagging of the Penn-II Treebank, followed by a description of how the linguistics of the Penn-II Treebank relate to LFG. The automatic annotation of such Treebank grammars is difficult as annotation rules often need to identify sub-sequences in the right-hand-sides of (often) flat Treebank rules as they explicitly encode head, complement and modifier relations. The algorithm we have developed is designed to handle these flat grammar rules. We describe the methodology used to encode the linguistic generalisations needed to annotate Treebank resources with LFG f-structure information, which, unlike previous approaches to this problem, scales up to the size of the WSJ section of the Penn-II Treebank.
Finally, we present and assess a number of automatic evaluation methodologies for assessing the effectiveness of the techniques we have developed. We first employ a quantitative evaluation, whcih measures the coverage of our annotation algorithm with respect to rule types and tokens, and calculates the degree of fragmentation of the automatically generated f-structure. Secondly, we present a qualitative evaluation, which measures the quality of the f-structures produced against a manually constructed ‘gold standard’ set of f-structures. Finally, we summarise our work to date, and outline possibilities for further work.
Metadata
Item Type: | Thesis (Master of Science) |
---|---|
Date of Award: | 2003 |
Refereed: | No |
Supervisor(s): | Way, Andy and van Genabith, Josef |
Uncontrolled Keywords: | Lexical-functional grammar; LFG; Treebanks; Generative grammar |
Subjects: | Computer Science > Computational linguistics |
DCU Faculties and Centres: | DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing |
Use License: | This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License |
ID Code: | 18049 |
Deposited On: | 30 Apr 2013 13:20 by Celine Campbell . Last Modified 04 Dec 2019 13:46 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
4MB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record