Browse DORAS
Browse Theses
Latest Additions
Creative Commons License
Except where otherwise noted, content on this site is licensed for use under a:

Handling named entities and compound verbs in phrase-based statistical machine translation

Pal, Santanu and Kumar Naskar, Sudip and Pecina, Pavel and Bandyopadhyay, Sivaji and Way, Andy (2010) Handling named entities and compound verbs in phrase-based statistical machine translation. In: MWE 2010 - Workshop on Multiword Expressions: from Theory to Applications, 28 August 2010, Beijing, China.

Full text available as:

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader


Data preprocessing plays a crucial role in phrase-based statistical machine translation (PB-SMT). In this paper, we show how single-tokenization of two types of multi-word expressions (MWE), namely named entities (NE) and compound verbs, as well as their prior alignment can boost the performance of PB-SMT. Single-tokenization of compound verbs and named entities (NE) provides significant gains over the baseline PB-SMT system. Automatic alignment of NEs substantially improves the overall MT performance, and thereby the word alignment quality indirectly. For establishing NE alignments, we transliterate source NEs into the target language and then compare them with the target NEs. Target language NEs are first converted into a canonical form before the comparison takes place. Our best system achieves statistically significant improvements (4.59 BLEU points absolute, 52.5% relative improvement) on an English—Bangla translation task.

Item Type:Conference or Workshop Item (Paper)
Event Type:Workshop
Subjects:Computer Science > Machine translating
DCU Faculties and Centres:Research Initiatives and Centres > Centre for Next Generation Localisation (CNGL)
DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Published in:Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications. . Association for Computational Linguistics.
Publisher:Association for Computational Linguistics
Official URL:
Copyright Information:© 2010 Association for Computational Linguistics
Funders:Science Foundation Ireland
ID Code:15810
Deposited On:10 Nov 2010 16:25 by Shane Harper. Last Modified 10 Nov 2010 16:25

Download statistics

Archive Staff Only: edit this record