A systematic comparison between SMT and NMT on translating user-generated content

Lohar, Pintu; Popović, Maja; Alfi, Haithem; Way, Andy

Lohar, Pintu ORCID: 0000-0002-5328-1585, Popović, Maja ORCID: 0000-0001-8234-8745, Alfi, Haithem ORCID: 0000-0002-7449-4707 and Way, Andy ORCID: 0000-0001-5736-5930 (2019) A systematic comparison between SMT and NMT on translating user-generated content. In: 20th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2019), 7 - 13 Apr 2019, La Rochelle, France.

Abstract
Metadata
Downloads
Documents

[+][-]

Abstract

Twitter has become an immensely popular platform where the users can share information within a certain character limit (280 characters) which encourages them to deliver short and informal messages (tweets). In general, machine translation (MT) of tweets is a challenging task. However, for translating German tweets about football into English, it has been shown that a moderate translation performance in terms of the BLEU score can be achieved using the phrase-based translation engines built on a tiny parallel Twitter data set [1]. In this work, we propose to further increase the translation quality using the neural machine translation models and applying the following strategies: (i) we back translate a set of out-of-domain English tweets released by ”Harvard data set” in 2017 into German and add the synthetic parallel data to the tiny parallel data used in [1]; (ii) as tweets are short in general, we extract short text pairs from the large news-commentary parallel data and add it to the tiny Twitter parallel data set in order to restrict the length of the out-of-genre text segments. We build both phrase-based and neural MT systems (PBMT and NMT) using the above data combinations in order to perform a systematic comparison between the two approaches on translating tweets. Our experimental results reveal that the NMT system performs signiﬁcantly worse than the PBMT system when using only the tiny Twitter data set for MT training. In contrast, when additional data is used for training, the results show huge improvements of the NMT system and produce very similar BLEU scores as the PBMT system even with only few hundred thousands of additional synthetic parallel data.

Metadata

Item Type:	Conference or Workshop Item (Paper)
Event Type:	Conference
Refereed:	Yes
Subjects:	Computer Science > Machine translating
DCU Faculties and Centres:	DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing Research Institutes and Centres > ADAPT
Published in:	Proceedings of CICLing 2019, the 20th International Conference on Computational Linguistics and Intelligent Text Processing. Lecture Notes in Computer Science (LNCS) . Springer.
Publisher:	Springer
Copyright Information:	© 2019 The Authors
Use License:	This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:	y Science Foundation Ireland through ADAPT Centre (Grant 13/RC/2106)
ID Code:	23869
Deposited On:	21 Oct 2019 14:49 by Andrew Way . Last Modified 05 May 2023 16:31

Documents

Full text available as:

[thumbnail of A_Systematic_Comparison_Between_SMT_and_NMT_on_Translating_User_Generated_Content.pdf]

Preview

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
637kB

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

DORAS | DCU Research Repository

A systematic comparison between SMT and NMT on translating user-generated content

Downloads