Popović, Maja ORCID: 0000-0001-8234-8745 and Poncelas, Alberto ORCID: 0000-0002-5089-1687 (2020) Extracting correctly aligned segments from unclean parallel data using character n-gram matching. In: Conference on Language Technologies and Digital Humanities 2020, 24-25 Sept 2020, Ljubljana, Slovenia (Online).
Abstract
Training of Neural Machine Translation systems is a time- and resource-demanding task, especially when large amounts of parallel texts are used. In addition, it is sensitive to unclean parallel data. In this work, we explore a data cleaning method based on character n-gram matching. The method is particularly convenient for closely related language since the n-gram matching scores can be calculated directly on the source and the target parts of the training corpus. For more distant languages, a translation step is needed and then the MT output is compared with the corresponding original part. We show that the proposed method not only reduces the amount of training corpus, but
also can increase the system’s performance.
Metadata
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Event Type: | Conference |
Refereed: | Yes |
Subjects: | Computer Science > Machine translating |
DCU Faculties and Centres: | Research Institutes and Centres > ADAPT |
Published in: | Proceedings of the Conference on Language Technologies and Digital Humanities 2020. . SDJT – Slovensko društvo za jezikovne tehnologije. |
Publisher: | SDJT – Slovensko društvo za jezikovne tehnologije |
Official URL: | http://nl.ijs.si/jtdh20/pdf/JT-DH_2020_Popovic-et-... |
Copyright Information: | © 2020 The Authors |
Funders: | Science Foundation Ireland and co-funded by the European Regional Development Fund (ERDF) through Grant 13/RC/2106, European Association for Machine Translation under its programme “2019 Sponsorship of Activities”. |
ID Code: | 25025 |
Deposited On: | 18 Sep 2020 14:23 by Maja Popovic . Last Modified 08 Apr 2021 13:41 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
141kB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record