Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

Treebank annotation with a wide-coverage head-driven phrase structure grammar

Schmidtke, Dag (2004) Treebank annotation with a wide-coverage head-driven phrase structure grammar. Master of Science thesis, Dublin City University.

Abstract
In this dissertation I investigate ways to extend the annotation of treebanks, or parsed corpora, by taking advantage of the rich and sophisticated grammatical analysis embodied in a modern, constraint-based wide-coverage grammar. As the underlying processing engine I implement a full typed feature structure inference and HPSG parsing system in C#. I develop a method for annotating a treebank with typed feature structure information with the use of the LmGO ERG grammar, an existing widecoverage Head-Driven Phrase Structure Grammar (HPSG). I use standard techniques to head-lexicalise and binanse the treebank and further pre-process it to make it more compatible with the general grammatical structures assumed in HPSG. I then establish a mapping between local CFG and HPSG configurations and map local trees to HPSG phrase types. Finally the typed feature structures associated with the local trees are combined to complete resolved HPSG signs through constraint resolution and by applying the rules from the HPSG grammar. Discrepancies between the treebank and the HPSG grammar are analysed with respect to implications for grammar extension and automatic rich lexicon entry acquisition is also investigated. The aim of this work is to develop a method of constraint-based grammar-driven treebank annotation combining data- and theory-driven approaches to NLP. With this I aim to produce a richer treebank to demonstrate the benefits of using an existing widecoverage grammar for treebank annotation and to explore ways of using treebanks to extend grammar coverage for sophisticated wide-coverage constraint-based grammars, with possible implications for robust parsing. In experiments the annotation method achieves a coverage of 99 8% of the ATIS corpus, with 95 3% non-fragment trees receiving a successful resolution, using a basic HPSG grammar. Using the full LinGO ERG grammar, 68 8% of non-fragment trees are resolved, and lexical type mapping for main verbs and nouns achieves a level of detail close to that of pre-defined lexical items. Also several trees for which the un-annotated string cannot be parsed by the LinGO ERG grammar receive a resolution in the annotation method, and words and subcategorisation frames not in the LinGO ERG lexicon are identified and handled. With the direct use of LinGO ERG grammar in the annotation the resulting lexical and phrasal signs are fully LinGO-compatible and can be easily incorporated back in the grammar.
Metadata
Item Type:Thesis (Master of Science)
Date of Award:2004
Refereed:No
Supervisor(s):van Genabith, Josef
Uncontrolled Keywords:Parsing (Computer grammar); Head-driven phrase structure grammar; Treebanks
Subjects:Computer Science > Machine translating
Humanities > Linguistics
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Use License:This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License
ID Code:18220
Deposited On:24 May 2013 13:32 by Celine Campbell . Last Modified 24 May 2013 13:32
Documents

Full text available as:

[thumbnail of Dag_Schmidtke.pdf]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
2MB
Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record