Exploiting multi-word units in statistical parsing and generation

Cafferkey, Conor

Cafferkey, Conor (2008) Exploiting multi-word units in statistical parsing and generation. Master of Science thesis, Dublin City University.

Abstract
Metadata
Downloads
Documents

[+][-]

Abstract

Syntactic parsing is an important prerequisite for many natural language processing (NLP) applications. The task refers to the process of generating the tree of syntactic nodes with associated phrase category labels corresponding to a sentence. Our objective is to improve upon statistical models for syntactic parsing by leveraging multi-word units (MWUs) such as named entities and other classes of multi-word expressions. Multi-word units are phrases that are lexically, syntactically and/or semantically idiosyncratic in that they are to at least some degree non-compositional. If such units are identified prior to, or as part of, the parsing process their boundaries can be exploited as islands of certainty within the very large (and often highly ambiguous) search space. Luckily, certain types of MWUs can be readily identified in an automatic fashion (using a variety of techniques) to a near-human level of accuracy. We carry out a number of experiments which integrate knowledge about different classes of MWUs in several commonly deployed parsing architectures. In a supplementary set of experiments, we attempt to exploit these units in the converse operation to statistical parsing---statistical generation (in our case, surface realisation from Lexical-Functional Grammar f-structures). We show that, by exploiting knowledge about MWUs, certain classes of parsing and generation decisions are more accurately resolved. This translates to improvements in overall parsing and generation results which, although modest, are demonstrably significant.

Metadata

Item Type:	Thesis (Master of Science)
Date of Award:	November 2008
Refereed:	No
Supervisor(s):	van Genabith, Josef
Uncontrolled Keywords:	Statistical Parsing; Statistical Generation; Named Entities; Multi-Word Units;
Subjects:	Computer Science > Computational linguistics
DCU Faculties and Centres:	DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Use License:	This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License
Funders:	Irish Research Council for Science Engineering and Technology, Microsoft Research
ID Code:	615
Deposited On:	10 Nov 2008 11:19 by Josef Vangenabith . Last Modified 16 Nov 2009 17:18

Documents

Full text available as:

Preview

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
597kB

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

DORAS | DCU Research Repository

Exploiting multi-word units in statistical parsing and generation

Downloads