Extracting in-domain training corpora for neural machine translation using data selection methods

Cruz Silva, Catarina, Liu, Chao-Hong ORCID: 0000-0002-1235-6026, Poncelas, Alberto ORCID: 0000-0002-5089-1687 and Way, Andy ORCID: 0000-0001-5736-5930 (2018) Extracting in-domain training corpora for neural machine translation using data selection methods. In: Third Conference on Machine Translation (WMT), 31 Oct - 1 Nov 2018, Belgium, Brussels.

[+][-]

Abstract

Data selection is a process used in selecting a subset of parallel data for the training of machine translation (MT) systems, so that 1) resources for training might be reduced, 2) trained models could perform better than those trained with the whole corpus, and/or 3) trained models are more tailored to specific domains. It has been shown that for statistical MT (SMT), the use of data selection helps improve the MT performance significantly. In this study, we reviewed three data selection approaches for MT, namely Term Frequency– Inverse Document Frequency, Cross-Entropy Difference and Feature Decay Algorithm, and conducted experiments on Neural Machine Translation (NMT) with the selected data using the three approaches. The results showed that for NMT systems, using data selection also improved the performance, though the gain is not as much as for SMT systems.

Metadata

Item Type:	Conference or Workshop Item (Paper)
Event Type:	Conference
Refereed:	Yes
Subjects:	Computer Science > Machine translating
DCU Faculties and Centres:	DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing Research Institutes and Centres > ADAPT
Published in:	Proceedings of the Third Conference on Machine Translation (WMT); Research Papers. 1. Association for Computational Linguistics.
Publisher:	Association for Computational Linguistics
Official URL:	http://dx.doi.org/10.18653/v1/W18-64023
Copyright Information:	© 2018 Association for Computational Linguistics
Use License:	This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:	ADAPT Centre for Digital Content Technology is funded under the SFI Research Centres Programme (Grant No. 13/RC/2106) and is co-funded under the European Regional Development Fund., European Union’s Horizon 2020 Research and Innovation programme under the Marie SkłodowskaCurie Actions (Grant No. 734211).
ID Code:	23338
Deposited On:	21 May 2019 15:45 by Thomas Murtagh . Last Modified 22 Jan 2021 14:17

Documents

Full text available as:

[thumbnail of Extracting_In-domain_Training_Data_for_Neural_Machine_Translation_Using_Data_Selection_Methods[1].pdf]

Preview

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
181kB

Metrics

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

DORAS | DCU Research Repository

Extracting in-domain training corpora for neural machine translation using data selection methods

Altmetric Badge

Dimensions Badge

Downloads