Extracting in-domain training corpora for neural machine translation
using data selection methods
Cruz Silva, Catarina, Liu, Chao-HongORCID: 0000-0002-1235-6026, Poncelas, AlbertoORCID: 0000-0002-5089-1687 and Way, AndyORCID: 0000-0001-5736-5930
(2018)
Extracting in-domain training corpora for neural machine translation
using data selection methods.
In: Third Conference on Machine Translation (WMT), 31 Oct - 1 Nov 2018, Belgium, Brussels.
Data selection is a process used in selecting a subset of parallel data for the training
of machine translation (MT) systems, so that
1) resources for training might be reduced,
2) trained models could perform better than
those trained with the whole corpus, and/or 3)
trained models are more tailored to specific domains. It has been shown that for statistical
MT (SMT), the use of data selection helps improve the MT performance significantly. In
this study, we reviewed three data selection
approaches for MT, namely Term Frequency–
Inverse Document Frequency, Cross-Entropy
Difference and Feature Decay Algorithm, and
conducted experiments on Neural Machine
Translation (NMT) with the selected data using the three approaches. The results showed
that for NMT systems, using data selection
also improved the performance, though the
gain is not as much as for SMT systems.
This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:
ADAPT Centre for Digital Content Technology is funded under the SFI Research Centres Programme (Grant No. 13/RC/2106) and is co-funded under the European Regional Development Fund., European Union’s Horizon 2020 Research and Innovation programme under the Marie SkłodowskaCurie Actions (Grant No. 734211).
ID Code:
23338
Deposited On:
21 May 2019 15:45 by
Thomas Murtagh
. Last Modified 22 Jan 2021 14:17