Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

Neural machine translation for translating into Croatian and Serbian

Popović, Maja orcid logoORCID: 0000-0001-8234-8745, Poncelas, Alberto orcid logoORCID: 0000-0002-5089-1687, Brkić Bakarić, Marija orcid logoORCID: 0000-0003-4079-4012 and Way, Andy orcid logoORCID: 0000-0001-5736-5930 (2020) Neural machine translation for translating into Croatian and Serbian. In: 7th Workshop on NLP for Similar Languages, Varieties and Dialects, 13 Dec 2020, Barcelona, Spain (on-line).

Abstract
In this work, we systematically investigate different set-ups for training of neural machine translation (NMT) systems for translation into Croatian and Serbian, two closely related South Slavic languages. We explore English and German as source languages, different sizes and types of training corpora, as well as bilingual and multilingual systems. We also explore translation of English IMDb user movie reviews, a domain/genre where only monolingual data are available. First, our results confirm that multilingual systems with joint target languages perform better. Furthermore, translation performance from English is much better than from German, partly because German is morphologically more complex and partly because the corpus consists mostly of parallel human translations instead of original text and its human translation. The translation from German should be further investigated systematically. For translating user reviews, creating synthetic in-domain parallel data through back- and forward-translation and adding them to a small out-of-domain parallel corpus can yield performance comparable with a system trained on a full out-of-domain corpus. However, it is still not clear what is the optimal size of synthetic in-domain data, especially for forward-translated data where the target language is machine translated. More detailed research including manual evaluation and analysis is needed in this direction.
Metadata
Item Type:Conference or Workshop Item (Paper)
Event Type:Workshop
Refereed:Yes
Subjects:Computer Science > Machine learning
Computer Science > Machine translating
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Research Institutes and Centres > ADAPT
Published in: Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects. . International Committee on Computational Linguistics (ICCL).
Publisher:International Committee on Computational Linguistics (ICCL)
Official URL:https://aclanthology.org/2020.vardial-1.10
Copyright Information:© 2021 The Authors.
Funders:Science Foundation Ireland through the SFI Research Centres Programme under Grant 13/RC/2106, European Regional Development Fund (ERDF), European Association for Machine Translation (EAMT) under its programme “2019 Sponsorship of Activities”.
ID Code:28355
Deposited On:23 May 2023 11:29 by Maja Popovic . Last Modified 23 May 2023 11:29
Documents

Full text available as:

[thumbnail of 2020.vardial-1.10.pdf]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution 4.0
277kB
Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record