Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

Data selection with feature decay algorithms using an approximated target side

Way, Andy orcid logoORCID: 0000-0001-5736-5930, Poncelas, Alberto orcid logoORCID: 0000-0002-5089-1687 and Maillette de Buy Wenniger, Gideon (2018) Data selection with feature decay algorithms using an approximated target side. In: 15th International Workshop on Spoken Language Translation (IWSLT 2018), 29-30 Apr 2018, Bruges, Belgium.

Abstract
AbstractData selection techniques applied to neural machine trans-lation (NMT) aim to increase the performance of a model byretrieving a subset of sentences for use as training data.One of the possible data selection techniques are trans-ductive learning methods, which select the data based on thetest set, i.e. the document to be translated. A limitation ofthese methods to date is that using the source-side test setdoes not by itself guarantee that sentences are selected withcorrect translations, or translations that are suitable given thetest-set domain. Some corpora, such as subtitle corpora, maycontain parallel sentences with inaccurate translations causedby localization or length restrictions.In order to try to fix this problem, in this paper we pro-pose to use an approximated target-side in addition to thesource-side when selecting suitable sentence-pairs for train-ing a model. This approximated target-side is built by pre-translating the source-side.In this work, we explore the performance of this generalidea for one specific data selection approach called FeatureDecay Algorithms (FDA).We train German-English NMT models on data selectedby using the test set (source), the approximated target side,and a mixture of both. Our findings reveal that models builtusing a combination of outputs of FDA (using the test setand an approximated target side) perform better than thosesolely using the test set. We obtain a statistically significantimprovement of more than 1.5 BLEU points over a modeltrained with all data, and more than 0.5 BLEU points over astrong FDA baseline that uses source-side information only.
Metadata
Item Type:Conference or Workshop Item (Paper)
Event Type:Conference
Refereed:Yes
Subjects:Computer Science > Machine translating
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Research Institutes and Centres > ADAPT
Published in: Turchi, Marco, Niehues, Jan and Frederico, Marcello, (eds.) Proceedings of the 15th International Workshop on Spoken Language Translation. . IWSLT.
Publisher:IWSLT
Official URL:https://workshop2018.iwslt.org/downloads/Proceedin...
Copyright Information:© 2018 The authors
Use License:This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:ADAPT Centre for Digital Content Technology which is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is cofunded under the European Regional Development Fund., European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 713567. 8
ID Code:23879
Deposited On:25 Oct 2019 10:20 by Andrew Way . Last Modified 25 Oct 2019 10:20
Documents

Full text available as:

[thumbnail of FDA_NMT_aproximated_targetside.pdf]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
274kB
Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record