The limited availability of in-domain training data is a major issue in the training of application-specific neural machine translation
models. Professional outsourcing of bilingual data collections is costly and often not feasible. In this paper we analyze the influence of
using crowdsourcing as a scalable way to obtain translations of target in-domain data having in mind that the translations can be of a
lower quality. We apply crowdsourcing with carefully designed quality controls to create parallel corpora for the educational domain
by collecting translations of texts from MOOCs from English to eleven languages, which we then use to fine-tune neural machine
translation models previously trained on general-domain data. The results from our research indicate that crowdsourced data collected
with proper quality controls consistently yields performance gains over general-domain baseline systems, and systems fine-tuned with
pre-existing in-domain corpora.
McCrae, John P., Chiarcos, Christian, Declerck, Thierry, Gracia, Jorge and Klimek, Bettina, (eds.)
Proceedings of the 6th Workshop on Linked Data in Linguistic (LDL-2018).
.
European Language Resource Association. ISBN 979-10-95546-19-1
This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:
TraMOOC project (Translation for Massive Open Online Courses) funded by the European Commission under H2020-ICT2014/H2020-ICT-2014-1 under grant agreement number 644333., grant EP/L01503X/1 for the University of Edinburgh School of Informatics Centre for Doctoral Training in Pervasive Parallelism from the UK Engineering and Physical Sciences Research Council (EPSRC).
ID Code:
23201
Deposited On:
24 Apr 2019 13:51 by
Thomas Murtagh
. Last Modified 20 Jan 2021 16:36