Schmidtke, Dag (2004) Treebank annotation with a wide-coverage head-driven phrase structure grammar. Master of Science thesis, Dublin City University.
Abstract
In this dissertation I investigate ways to extend the annotation of treebanks, or parsed corpora, by taking advantage of the rich and sophisticated grammatical analysis embodied in a modern, constraint-based wide-coverage grammar. As the underlying processing engine I implement a full typed feature structure inference and HPSG parsing system in C#. I develop a method for annotating a treebank with typed feature structure information with the use of the LmGO ERG grammar, an existing widecoverage Head-Driven Phrase Structure Grammar (HPSG). I use standard techniques to head-lexicalise and binanse the treebank and further pre-process it to make it more compatible with the general grammatical structures assumed in HPSG. I then establish a mapping between local CFG and HPSG configurations and map local trees to HPSG phrase types. Finally the typed feature structures associated with the local trees are combined to complete resolved HPSG signs through constraint resolution and by applying the rules from the HPSG grammar. Discrepancies between the treebank and the HPSG grammar are analysed with respect to implications for grammar extension and automatic rich lexicon entry acquisition is also investigated.
The aim of this work is to develop a method of constraint-based grammar-driven treebank annotation combining data- and theory-driven approaches to NLP. With this I aim to produce a richer treebank to demonstrate the benefits of using an existing widecoverage grammar for treebank annotation and to explore ways of using treebanks to extend grammar coverage for sophisticated wide-coverage constraint-based grammars, with possible implications for robust parsing.
In experiments the annotation method achieves a coverage of 99 8% of the ATIS corpus, with 95 3% non-fragment trees receiving a successful resolution, using a basic HPSG grammar. Using the full LinGO ERG grammar, 68 8% of non-fragment trees are resolved, and lexical type mapping for main verbs and nouns achieves a level of detail close to that of pre-defined lexical items. Also several trees for which the un-annotated string cannot be parsed by the LinGO ERG grammar receive a resolution in the annotation method, and words and subcategorisation frames not in the LinGO ERG lexicon are identified and handled. With the direct use of LinGO ERG grammar in the annotation the resulting lexical and phrasal signs are fully LinGO-compatible and can be easily incorporated back in the grammar.
Metadata
Item Type: | Thesis (Master of Science) |
---|---|
Date of Award: | 2004 |
Refereed: | No |
Supervisor(s): | van Genabith, Josef |
Uncontrolled Keywords: | Parsing (Computer grammar); Head-driven phrase structure grammar; Treebanks |
Subjects: | Computer Science > Machine translating Humanities > Linguistics |
DCU Faculties and Centres: | DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing |
Use License: | This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License |
ID Code: | 18220 |
Deposited On: | 24 May 2013 13:32 by Celine Campbell . Last Modified 24 May 2013 13:32 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
2MB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record