Data selection with feature decay algorithms using an approximated target side

Poncelas, Alberto; Maillette de Buy Wenniger, Gideon; Way, Andy

Poncelas, Alberto ORCID: 0000-0002-5089-1687, Maillette de Buy Wenniger, Gideon and Way, Andy ORCID: 0000-0001-5736-5930 (2018) Data selection with feature decay algorithms using an approximated target side. In: The 15th International Workshop on Spoken Language Translation 2018, 29-30 Oct 2018, Bruges, Belgium.

Abstract
Metadata
Downloads
Documents

[+][-]

Abstract

Data selection techniques applied to neural machine translation (NMT) aim to increase the performance of a model by retrieving a subset of sentences for use as training data. One of the possible data selection techniques are transductive learning methods, which select the data based on the test set, i.e. the document to be translated. A limitation of these methods to date is that using the source-side test set does not by itself guarantee that sentences are selected with correct translations, or translations that are suitable given the test-set domain. Some corpora, such as subtitle corpora, may contain parallel sentences with inaccurate translations caused by localization or length restrictions. In order to try to fix this problem, in this paper we propose to use an approximated target-side in addition to the source-side when selecting suitable sentence-pairs for training a model. This approximated target-side is built by pretranslating the source-side. In this work, we explore the performance of this general idea for one specific data selection approach called Feature Decay Algorithms (FDA). We train German-English NMT models on data selected by using the test set (source), the approximated target side, and a mixture of both. Our findings reveal that models built using a combination of outputs of FDA (using the test set and an approximated target side) perform better than those solely using the test set. We obtain a statistically significant improvement of more than 1.5 BLEU points over a model trained with all data, and more than 0.5 BLEU points over a strong FDA baseline that uses source-side information only.

Metadata

Item Type:	Conference or Workshop Item (Poster)
Event Type:	Workshop
Refereed:	Yes
Uncontrolled Keywords:	Machine Translation; Statistical Machine Translation; Neural Machine Translation
Subjects:	UNSPECIFIED
DCU Faculties and Centres:	DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing Research Institutes and Centres > ADAPT
Use License:	This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
ID Code:	22883
Deposited On:	19 Dec 2018 12:47 by Gideon Maillette De buy . Last Modified 05 Nov 2019 09:29

Documents

Full text available as:

Preview

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
312kB

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

DORAS | DCU Research Repository

Data selection with feature decay algorithms using an approximated target side

Downloads