Parsing with automatically acquired, wide-coverage, robust, probabilistic LFG approximations
Cahill, Aoife (2004) Parsing with automatically acquired, wide-coverage, robust, probabilistic LFG approximations. PhD thesis, Dublin City University.
Full text available as:
Traditionally, rich, constraint-based grammatical resources have been hand-coded. Scaling such resources beyond toy fragments to unrestricted, real text is knowledge-intensive, timeconsuming and expensive. The work reported in this thesis is part of a larger project to automate as much as possible the construction of wide-coverage, deep, constraint-based grammatical resources from treebanks. The Penn-II treebank is a large collection of parse-annotated newspaper text. We have designed a Lexical-Functional Grammar (LFG) (Kaplan and Bresnan, 1982) f-structure annotation algorithm to automatically annotate this treebank with f-structure information approximating to basic predicate-argument or dependency structures (Cahill et al., 2002c, 2004a). We then use the f-structure-annotated treebank resource to automatically extract grammars and lexical resources for parsing new text into f-structures. We have designed and implemented the Treebank Tool Suite (TTS) to support the linguistic work that seeds the automatic f-structure annotation algorithm (Cahill and van Genabith, 2002) and the F-Structure Annotation Tool (FSAT) to validate and visualise the results of automatic f-structure annotation. We have designed and implemented two PCFG-based probabilistic parsing architectures for parsing unseen text into f-structures: the pipeline and the integrated model. Both architectures parse raw text into basic, but possibly incomplete, predicate-argument structures (“proto f-structures”) with long distance dependencies (LDDs) unresolved (Cahill et al., 2002c). We have designed and implemented a method for automatically resolving LDDs at f-structure level based on a finite approximation of functional uncertainty equations (Kaplan and Zaenen, 1989) automatically acquired from the f structure-annotated treebank resource (Cahill et al., 2004b). To date, the best result achieved by our own Penn-II induced grammars is a dependency f-score of 80.33% against the PARC 700, an improvement of 0.73% over the best handcrafted grammar of (Kaplan et al., 2004). The processing architecture developed in this thesis is highly flexible: using external, state-of-the-art parsing technologies (Charniak, 2000) in our pipeline model, we achieve a dependency f-score of 81.79% against the PARC 700, an improvement of 2.19% over the results reported in Kaplan et al. (2004). We have also ported our grammar induction methodology to German and the TIGER treebank resource (Cahill et al., 2003a). We have developed a method for treebank-based, wide-coverage, deep, constraintbased grammar acquisition. The resulting PCFG-based LFG approximations parse the Penn-II treebank with wider coverage (measured in terms of complete spanning parse) and parsing results comparable to or better than those achieved by the best hand-crafted grammars, with, we believe, considerably less grammar development effort. We believe that our approach successfully addresses the knowledge-acquisition bottleneck (familiar from rule-based approaches to Al and NLP) in wide-coverage, constraint-based grammar development. Our approach can provide an attractive, wide-coverage, multilingual, deep, constraint-based grammar acquisition paradigm.
Archive Staff Only: edit this record