Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

Handling named entities and compound verbs in phrase-based statistical machine translation

Pal, Santanu, Kumar Naskar, Sudip, Pecina, Pavel, Bandyopadhyay, Sivaji and Way, Andy orcid logoORCID: 0000-0001-5736-5930 (2010) Handling named entities and compound verbs in phrase-based statistical machine translation. In: MWE 2010 - Workshop on Multiword Expressions: from Theory to Applications, 28 August 2010, Beijing, China.

Abstract
Data preprocessing plays a crucial role in phrase-based statistical machine translation (PB-SMT). In this paper, we show how single-tokenization of two types of multi-word expressions (MWE), namely named entities (NE) and compound verbs, as well as their prior alignment can boost the performance of PB-SMT. Single-tokenization of compound verbs and named entities (NE) provides significant gains over the baseline PB-SMT system. Automatic alignment of NEs substantially improves the overall MT performance, and thereby the word alignment quality indirectly. For establishing NE alignments, we transliterate source NEs into the target language and then compare them with the target NEs. Target language NEs are first converted into a canonical form before the comparison takes place. Our best system achieves statistically significant improvements (4.59 BLEU points absolute, 52.5% relative improvement) on an English—Bangla translation task.
Metadata
Item Type:Conference or Workshop Item (Paper)
Event Type:Workshop
Refereed:Yes
Subjects:Computer Science > Machine translating
DCU Faculties and Centres:Research Institutes and Centres > Centre for Next Generation Localisation (CNGL)
DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Published in: Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications. . Association for Computational Linguistics.
Publisher:Association for Computational Linguistics
Official URL:http://www.aclweb.org/anthology/W/W10/W10-3707.pdf
Copyright Information:© 2010 Association for Computational Linguistics
Funders:Science Foundation Ireland
ID Code:15810
Deposited On:10 Nov 2010 16:25 by Shane Harper . Last Modified 09 Nov 2018 14:31
Documents

Full text available as:

[thumbnail of Handling_Named_Entities_and_Compound_Verbs_in.pdf]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
185kB
Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record