O'Donovan, Ruth (2006) Automatic extraction of large-scale multilingual lexical resources. PhD thesis, Dublin City University.
Abstract
In this thesis, I present a methodology for treebank- or parser-based acquisition of lexical resources, in particular sub categorisation frames. The method uses an automatic Lexical Functional Grammar (LFG) f-structure annotation algorithm (Cahill et al., 2002a, 2004a; Burke et al., 2004b) and has been applied to the Penn-II and Penn-III treebanks (Marcus et al., 1994) with a total of about 1.3 million words as well as to (a subset of) the British National Corpus (Bernard, 2002) with about 90 million words.
I extract abstract syntactic function-based subcategorisation frames (LFG semantic forms), traditional CFG category-based subcategorisation frames as well as mixed function/category-based frames, with or without preposition information for obliques and particle information for subcategorised particles. The approach distinguishes between active and passive frames, and reflects the effects of long-distance dependencies (LDDs) in the source d ata structures. Frames are associated with conditional probabilities, facilitating the optimisation of the extracted lexicon for quality or coverage through filtering. In contrast to many other approaches, subcategorisation frame types are not predefined but acquired from the source data.
I carried out large-scale evaluations of the complete set of forms extracted against the COMLEX and OALD resources. To my knowledge, this is the largest and most complete evaluation of subcategorisation frames for English. The parser-based system is also evaluated against Korhonen (2002) with a statistically significant improvement over the previous best score.
The automatic annotation methodology, as well as the grammar and lexicon extraction techniques for English have been successfully migrated to Spanish, German and Chinese treebanks despite typological differences and variations in treebank encoding. I believe that this approach provides an attractive and efficient multilingual grammar and lexicon development paradigm.
Metadata
Item Type: | Thesis (PhD) |
---|---|
Date of Award: | 2006 |
Refereed: | No |
Supervisor(s): | van Genabith, Josef and Way, Andy |
Uncontrolled Keywords: | Treebanks; Lexical Functional Grammar; LFG |
Subjects: | Computer Science > Computational linguistics Computer Science > Machine translating |
DCU Faculties and Centres: | DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing |
Use License: | This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License |
ID Code: | 18134 |
Deposited On: | 10 May 2013 10:26 by Celine Campbell . Last Modified 25 Jan 2019 12:03 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
4MB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record