Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

Neural machine translation between similar south-Slavic languages

Popović, Maja orcid logoORCID: 0000-0001-8234-8745 and Poncelas, Alberto orcid logoORCID: 0000-0002-5089-1687 (2020) Neural machine translation between similar south-Slavic languages. In: 2020 Fifth Conference on Machine Translation (WMT20), 19-20 Nov 2020, Dominican Republic (Online).

This paper describes the ADAPT-DCU machine translation systems built for the WMT 2020 shared task on Similar Language Translation. We explored several set-ups for NMT for Croatian–Slovenian and Serbian–Slovenian language pairs in both translation directions. Our experiments focus on different amounts and types of training data: we first apply basic filtering on the OpenSubtitles training corpora, then we perform additional cleaning of remaining misaligned segments based on character n-gram matching. Finally, we make use of additional monolingual data by creating synthetic parallel data through back-translation. Automatic evaluation shows that multilingual systems with joint Serbian and Croatian data are better than bilingual, as well as that character-based cleaning leads to improved scores while using less data. The results also confirm once more that adding back-translated data further improves the performance, especially when the synthetic data is similar to the desired domain of the development and test set. This, however, might come at a price of prolonged training time, especially for multitarget systems.
Item Type:Conference or Workshop Item (Paper)
Event Type:Conference
Subjects:Computer Science > Computational linguistics
Computer Science > Machine translating
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Research Institutes and Centres > ADAPT
Published in: Fifth Conference on Machine Translation (at EMNLP-2020). . Association for Computational Linguistics (ACL).
Publisher:Association for Computational Linguistics (ACL)
Official URL:https://www.aclweb.org/anthology/2020.wmt-1.51
Copyright Information:© 2020 The Authors. CC-BY- 4.0
Funders:Science Foundation Ireland through the SFI Research Cen-tres Programme 13/RC/2106, European Regional Development Fund (ERDF)
ID Code:25080
Deposited On:18 Nov 2020 17:01 by Alberto Poncelas . Last Modified 25 Jun 2021 13:02

Full text available as:

[thumbnail of WMT2020.pdf]
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader


Downloads per month over past year

Archive Staff Only: edit this record