Building machine translation system for software product descriptions using domain-specific sub-corpora extraction
Lohar, PintuORCID: 0000-0002-5328-1585, Popovic, MajaORCID: 0000-0001-8234-8745 and Habruseva, Tanya
(2022)
Building machine translation system for software product descriptions using domain-specific sub-corpora extraction.
In: 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), 12-16 Sept 2022, Orlando, FL, USA.
Building Machine Translation systems for a specific domain requires a sufficiently large and good quality parallel corpus in that domain. However, this is a bit challenging task due to the lack of parallel data in many domains such as economics, science and technology, sports etc. In this work, we build English-to-French translation systems for software product descriptions scraped from LinkedIn website. Moreover, we developed a first-ever test parallel data set of product descriptions. We conduct experiments by building a baseline translation system trained on general domain and then domain-adapted systems using sentence-embedding based corpus filtering and domain-specific sub-corpora extraction. All the systems are tested on our newly developed data set mentioned earlier. Our experimental evaluation reveals that the domain-adapted model based on our proposed approaches outperforms the baseline.
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track).
.
Association for Machine Translation in the Americas.
Publisher:
Association for Machine Translation in the Americas
Linkedin, ADAPT Centre for Digital Content Technology which is funded under the Science Foundation Ireland (SFI) Research Centres Programme (Grant No. 13/RC/2106).
ID Code:
28367
Deposited On:
25 May 2023 13:56 by Maja Popovic. Last Modified 25 May 2023 13:56