Domain-specific text generation for machine translation

Moslem, Yasmin ORCID: 0000-0003-4595-6877, Haque, Rejwanul ORCID: 0000-0003-1680-0099, Kelleher, John D. ORCID: 0000-0001-6462-3248 and Way, Andy ORCID: 0000-0001-5736-5930 (2022) Domain-specific text generation for machine translation. In: 15th Biennial Conference of the Association for Machine Translation in the Americas, 12-16 Sept 2022, Orlando, USA.

Abstract
Metadata
Downloads
Documents

[+][-]

Abstract

Preservation of domain knowledge from the source to target is crucial in any translation workflow. It is common in the translation industry to receive highly specialized projects, where there is hardly any parallel in-domain data. In such scenarios where there is insufficient in-domain data to fine-tune Machine Translation (MT) models, producing translations that are consistent with the relevant context is challenging. In this work, we propose a novel approach to domain adaptation leveraging state-of-the-art pretrained language models (LMs) for domain-specific data augmentation for MT, simulating the domain characteristics of either (a) a small bilingual dataset, or (b) the monolingual source text to be translated. Combining this idea with back-translation, we can generate huge amounts of synthetic bilingual in-domain data for both use cases. For our investigation, we use the state-of-the-art Transformer architecture. We employ mixed fine-tuning to train models that significantly improve translation of in-domain texts. More specifically, in both scenarios, our proposed methods achieve improvements of approximately 5-6 BLEU and 2-3 BLEU, respectively, on the Arabic-to-English and English-to-Arabic language pairs. Furthermore, the outcome of human evaluation corroborates the automatic evaluation results.

Metadata

Item Type:	Conference or Workshop Item (Paper)
Event Type:	Conference
Refereed:	Yes
Subjects:	Computer Science > Machine translating
DCU Faculties and Centres:	DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing Research Institutes and Centres > ADAPT
Published in:	Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas. 1. Association for Machine Translation in the Americas.
Publisher:	Association for Machine Translation in the Americas
Official URL:	https://aclanthology.org/2022.amta-research.2/
Copyright Information:	© 2022 The Authors.
Funders:	Science Foundation Ireland Centre for Research Training in Digitally-Enhanced Reality (d-real) under Grant No. 18/CRT/6224, Science Foundation Ireland (SFI) Research Centres Programme (Grant No. 13/RC/2106), European Regional Development Fund, and Microsoft Research.
ID Code:	28322
Deposited On:	10 May 2023 13:58 by Thomas Murtagh . Last Modified 10 May 2023 13:58

Documents

Full text available as:

Preview

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution 4.0
934kB

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

DORAS | DCU Research Repository

Domain-specific text generation for machine translation

Downloads