Pinyin as subword unit for Chinese-sourced neural
machine translation

Du, Jinhua; Way, Andy

Du, Jinhua ORCID: 0000-0002-3267-4881 and Way, Andy ORCID: 0000-0001-5736-5930 (2017) Pinyin as subword unit for Chinese-sourced neural machine translation. In: 25th Irish Conference on Artificial Intelligence and Cognitive Science (AICS 2017 ), 7-8 Dec 2017, Dublin, Ireland.

Abstract
Metadata
Downloads
Documents

[+][-]

Abstract

Unknown word (UNK) or open vocabulary is a challenging problem for neural machine translation (NMT). For alphabetic languages such as English, German and French, transforming a word into subwords is an effective way to alleviate the UNK problem, such as the Byte Pair encoding (BPE) algorithm. However, for the stroke-based languages, such as Chinese, aforementioned method is not effective enough for translation quality. In this paper, we propose to utilize Pinyin, a romanization system for Chinese characters, to convert Chinese characters to subword units to alleviate the UNK problem. We first investigate that how Pinyin and its four diacritics denoting tones affect translation performance of NMT systems, and then propose different strategies to utilise Pinyin and tones as input factors for Chinese–English NMT. Extensive experiments conducted on Chinese–English translation demonstrate that the proposed methods can remarkably improve the translation quality, and can effectively alleviate the UNK problem for Chinese-sourced translation.

Metadata

Item Type:	Conference or Workshop Item (Paper)
Event Type:	Conference
Refereed:	Yes
Subjects:	Computer Science > Machine translating
DCU Faculties and Centres:	DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing Research Institutes and Centres > ADAPT
Published in:	McAuley, John and McKeever, Susan, (eds.) Proceedings of the 25th Irish Conference on Artificial Intelligence and Cognitive Science. 2086. CEUR-WS.
Publisher:	CEUR-WS
Official URL:	http://ceur-ws.org/Vol-2086/AICS2017_paper_14.pdf
Copyright Information:	© 2017 the Authors
Use License:	This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:	ADAPT Centre for Digital Content Technology, funded under the SFI Research Centres Programme (Grant 13/RC/2106), SFI Industry Fellowship Programme 2016 (Grant 16/IFB/4490)
ID Code:	23197
Deposited On:	17 Apr 2019 14:19 by INVALID USER. Last Modified 17 Apr 2019 14:19

Documents

Full text available as:

[thumbnail of Pinyin as Subword Unit for Chinese-Sourced Neural Machine Translation.pdf]

Preview

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
245kB

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

DORAS | DCU Research Repository

Pinyin as subword unit for Chinese-sourced neural machine translation

Downloads