Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

Improving machine translation of educational content via crowdsourcing

Behnke, Maximiliana, Miceli Barone, Antonio Valerio, Sennrich, Rico, Sosoni, Vilelmini, Naskos, Thanasis, Takoulidou, Eirini, Stasimioti, Maria, Menno, van Zaanen, Castilho, Sheila orcid logoORCID: 0000-0002-8416-6555, Gaspari, Federico orcid logoORCID: 0000-0003-3808-8418, Georgakopoulou, Panayota orcid logoORCID: 0000-0001-9780-1813, Kordoni, Valia, Egg, Markus and Kermanidis, Katia Lida orcid logoORCID: 0000-0002-3270-5078 (2018) Improving machine translation of educational content via crowdsourcing. In: LREC 2018 - 11th International Conference on Language Resources and Evaluation, Miyazaki, Japan. ISBN 979-10-95546-19-1

Abstract
The limited availability of in-domain training data is a major issue in the training of application-specific neural machine translation models. Professional outsourcing of bilingual data collections is costly and often not feasible. In this paper we analyze the influence of using crowdsourcing as a scalable way to obtain translations of target in-domain data having in mind that the translations can be of a lower quality. We apply crowdsourcing with carefully designed quality controls to create parallel corpora for the educational domain by collecting translations of texts from MOOCs from English to eleven languages, which we then use to fine-tune neural machine translation models previously trained on general-domain data. The results from our research indicate that crowdsourced data collected with proper quality controls consistently yields performance gains over general-domain baseline systems, and systems fine-tuned with pre-existing in-domain corpora.
Metadata
Item Type:Conference or Workshop Item (Paper)
Event Type:Conference
Refereed:Yes
Uncontrolled Keywords:MOOCs; neural machine translation; crowdsourcing
Subjects:Computer Science > Machine translating
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Research Institutes and Centres > ADAPT
Published in: McCrae, John P., Chiarcos, Christian, Declerck, Thierry, Gracia, Jorge and Klimek, Bettina, (eds.) Proceedings of the 6th Workshop on Linked Data in Linguistic (LDL-2018). . European Language Resource Association. ISBN 979-10-95546-19-1
Publisher:European Language Resource Association
Official URL:https://www.aclweb.org/anthology/L18-1528
Copyright Information:© 2018 ELRA
Use License:This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:TraMOOC project (Translation for Massive Open Online Courses) funded by the European Commission under H2020-ICT2014/H2020-ICT-2014-1 under grant agreement number 644333., grant EP/L01503X/1 for the University of Edinburgh School of Informatics Centre for Doctoral Training in Pervasive Parallelism from the UK Engineering and Physical Sciences Research Council (EPSRC).
ID Code:23201
Deposited On:24 Apr 2019 13:51 by Thomas Murtagh . Last Modified 20 Jan 2021 16:36
Documents

Full text available as:

[thumbnail of Improving Machine Translation of Educational Content via Crowdsourcing.pdf]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
233kB
Metrics

Altmetric Badge

Dimensions Badge

Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record