Skip to main content
DORAS
DCU Online Research Access Service
Login (DCU Staff Only)
Selecting Backtranslated Data from Multiple Sources for Improved Neural Machine Translation

Soto, Xabier ORCID: 0000-0002-3622-6496, Shterionov, Dimitar ORCID: 0000-0001-6300-797X, Poncelas, Alberto ORCID: 0000-0002-5089-1687 and Way, Andy ORCID: 0000-0001-5736-5930 (2020) Selecting Backtranslated Data from Multiple Sources for Improved Neural Machine Translation. In: Annual Conference of the Association for Computational Linguistics, ACL, 5-10 July 2020, Seattle, WA, USA (Online).

Full text available as:

[img]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
346kB

Abstract

Machine translation (MT) has benefited from using synthetic training data originating from translating monolingual corpora, a technique known as backtranslation. Combining backtranslated data from different sources has led to better results than when using such data in isolation. In this work we analyse the impact that data translated with rule-based, phrase-based statistical and neural MT systems has on new MT systems. We use a real-world low-resource use-case (Basque-to-Spanish in the clinical domain) as well as a high-resource language pair (German-to-English) to test different scenarios with backtranslation and employ data selection to optimise the synthetic corpora. We exploit different data selection strategies in order to reduce the amount of data used, while at the same time maintaining high-quality MT systems. We further tune the data selection method by taking into account the quality of the MT systems used for backtranslation and lexical diversity of the resulting corpora. Our experiments show that incorporating backtranslated data from different sources can be beneficial, and that availing of data selection can yield improved performance.

Item Type:Conference or Workshop Item (Paper)
Event Type:Conference
Refereed:Yes
Subjects:Computer Science > Machine translating
DCU Faculties and Centres:UNSPECIFIED
Published in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. . Association for Computational Linguistics (ACL).
Publisher:Association for Computational Linguistics (ACL)
Official URL:https://www.aclweb.org/anthology/2020.acl-main.359.pdf
Copyright Information:© 2020 The Authors
Funders:Spanish Ministry of Economy and Competitiveness (MINECO) FPI grant number BES-2017-081045, Science Foundation Ireland (SFI) Research Centres Programme (Grant No. 13/RC/2106), European Regional Development Fund
ID Code:24425
Deposited On:25 Jun 2020 15:47 by Alberto Poncelas . Last Modified 22 Jan 2021 14:22

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

Altmetric
- Altmetric
+ Altmetric
  • Student Email
  • Staff Email
  • Student Apps
  • Staff Apps
  • Loop
  • Disclaimer
  • Privacy
  • Contact Us