Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

Bilingually motivated word segmentation for statistical machine translation

Ma, Yanjun and Way, Andy orcid logoORCID: 0000-0001-5736-5930 (2009) Bilingually motivated word segmentation for statistical machine translation. ACM Transactions on Asian Language Information Processing, 8 (2). ISSN 1530-0226

Abstract
We introduce a bilingually motivated word segmentation approach to languages where word boundaries are not orthographically marked, with application to Phrase-Based Statistical Machine Translation (PB-SMT). Our approach is motivated from the insight that PB-SMT systems can be improved by optimizing the input representation to reduce the predictive power of translation models. We firstly present an approach to optimize the existing segmentation of both source and target languages for PB-SMT and demonstrate the effectiveness of this approach using a Chinese–English MT task, that is, to measure the influence of the segmentation on the performance of PB-SMT systems. We report a 5.44% relative increase in Bleu score and a consistent increase according to other metrics. We then generalize this method for Chinese word segmentation without relying on any segmenters and show that using our segmentation PB-SMT can achieve more consistent state-of-the-art performance across two domains. There are two main advantages of our approach. First of all, it is adapted to the specific translation task at hand by taking the corresponding source (target) language into account. Second, this approach does not rely on manually segmented training data so that it can be automatically adapted for different domains.
Metadata
Item Type:Article (Published)
Refereed:Yes
Subjects:Computer Science > Machine translating
DCU Faculties and Centres:Research Institutes and Centres > Centre for Next Generation Localisation (CNGL)
Research Institutes and Centres > National Centre for Language Technology (NCLT)
Publisher:Association for Computing Machinery
Official URL:http://dx.doi.org/10.1145/1526252.1526255
Copyright Information:© 2009 ACM
Use License:This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:Science Foundation Ireland
ID Code:15814
Deposited On:29 Nov 2010 16:10 by Shane Harper . Last Modified 14 Nov 2018 16:31
Documents

Full text available as:

[thumbnail of Bilingually_Motivated_Word_Segmentation_for_SMT.pdf]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
572kB
Metrics

Altmetric Badge

Dimensions Badge

Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record