Bilingually motivated domain-adapted word segmentation for statistical machine translation
Ma, Yanjun and Way, Andy (2009) Bilingually motivated domain-adapted word segmentation for statistical machine translation. In: EACL 2009 Workshop on Computational Approaches to Semitic Languages, 31 March 2009, Athens, Greece.
Full text available as:
We introduce a word segmentation approach to languages where word boundaries are not orthographically marked,
with application to Phrase-Based Statistical Machine Translation (PB-SMT). Instead of using manually segmented monolingual domain-specific corpora to train segmenters, we make use of bilingual corpora and statistical word alignment techniques. First of all, our approach is
adapted for the specific translation task at hand by taking the corresponding source (target) language into account. Secondly, this approach does not rely on manually segmented training data so that it can be automatically adapted for different domains. We evaluate the performance of our segmentation approach on PB-SMT tasks from two domains and
demonstrate that our approach scores consistently among the best results across different data conditions.
Archive Staff Only: edit this record