Skip to main content
DORAS
DCU Online Research Access Service
Login (DCU Staff Only)
Using multiple subwords to improve English-Esperanto automated literary translation quality

Poncelas, Alberto ORCID: 0000-0002-5089-1687, Buts, Jan ORCID: 0000-0002-7657-804X, Hadley, James ORCID: 0000-0003-1950-2679 and Way, Andy ORCID: 0000-0001-5736-5930 (2020) Using multiple subwords to improve English-Esperanto automated literary translation quality. In: Workshop on Technologies for MT of Low Resource Languages (AACL-IJCNLP), 4 Dec 2020, Suzhou, China(Online).

Full text available as:

[img]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
174kB

Abstract

Building Machine Translation (MT) systems for low-resource languages remains challenging. For many language pairs, parallel data are not widely available, and in such cases MT models do not achieve results comparable to those seen with high-resource languages. When data are scarce, it is of paramount importance to make optimal use of the limited material available. To that end, in this paper we propose employing the same parallel sentences multiple times, only changing the way the words are split each time. For this purpose we use several Byte Pair Encoding models, with various merge operations used in their configuration. In our experiments, we use this technique to expand the available data and improve an MT system involving a low-resource language pair, namely English-Esperanto. As an additional contribution, we made available a set of English-Esperanto parallel data in the literary domain.

Item Type:Conference or Workshop Item (Speech)
Event Type:Workshop
Refereed:Yes
Subjects:Computer Science > Machine translating
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Research Initiatives and Centres > ADAPT
Published in: Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages (LoResMT 2020). . Association for Computational Linguistics.
Publisher:Association for Computational Linguistics
Official URL:https://www.aclweb.org/anthology/2020.loresmt-1.14
Copyright Information:© 2020 The Authors
Funders:SFI Research Centres Programme (Grant 13/RC/2106), Irish Research Council’s COALESCE scheme (COALESCE/2019/117)
ID Code:25172
Deposited On:04 Dec 2020 14:20 by Alberto Poncelas . Last Modified 25 Jun 2021 13:03

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

Altmetric
- Altmetric
+ Altmetric
  • Student Email
  • Staff Email
  • Student Apps
  • Staff Apps
  • Loop
  • Disclaimer
  • Privacy
  • Contact Us