Browse DORAS
Browse Theses
Search
Latest Additions
Creative Commons License
Except where otherwise noted, content on this site is licensed for use under a:

Exploiting multi-word units in statistical parsing and generation

Cafferkey, Conor (2008) Exploiting multi-word units in statistical parsing and generation. Master of Science thesis, Dublin City University.

Full text available as:

[img]PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
583Kb

Abstract

Syntactic parsing is an important prerequisite for many natural language processing (NLP) applications. The task refers to the process of generating the tree of syntactic nodes with associated phrase category labels corresponding to a sentence. Our objective is to improve upon statistical models for syntactic parsing by leveraging multi-word units (MWUs) such as named entities and other classes of multi-word expressions. Multi-word units are phrases that are lexically, syntactically and/or semantically idiosyncratic in that they are to at least some degree non-compositional. If such units are identified prior to, or as part of, the parsing process their boundaries can be exploited as islands of certainty within the very large (and often highly ambiguous) search space. Luckily, certain types of MWUs can be readily identified in an automatic fashion (using a variety of techniques) to a near-human level of accuracy. We carry out a number of experiments which integrate knowledge about different classes of MWUs in several commonly deployed parsing architectures. In a supplementary set of experiments, we attempt to exploit these units in the converse operation to statistical parsing---statistical generation (in our case, surface realisation from Lexical-Functional Grammar f-structures). We show that, by exploiting knowledge about MWUs, certain classes of parsing and generation decisions are more accurately resolved. This translates to improvements in overall parsing and generation results which, although modest, are demonstrably significant.

Item Type:Thesis (Master of Science)
Date of Award:November 2008
Refereed:No
Supervisor(s):van Genabith, Josef
Uncontrolled Keywords:Statistical Parsing; Statistical Generation; Named Entities; Multi-Word Units;
Subjects:Computer Science > Computational linguistics
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Use License:This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License
Funders:Irish Research Council for Science Engineering and Technology, Microsoft Research
ID Code:615
Deposited On:10 Nov 2008 11:19 by Josef Vangenabith. Last Modified 16 Nov 2009 17:18

Download statistics

Archive Staff Only: edit this record