Lohar, Pintu ORCID: 0000-0002-5328-1585, Popović, Maja ORCID: 0000-0001-8234-8745 and Habruseva, Tanya (2022) Building machine translation system for software product descriptions using domain-specific sub-corpora extraction. In: 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), 12-16 Sept 2022, Orlando, FL, USA.
Abstract
Building Machine Translation systems for a specific domain requires a sufficiently large and good quality parallel corpus in that domain. However, this is a bit challenging task due to the lack of parallel data in many domains such as economics, science and technology, sports etc. In this work, we build English-to-French translation systems for software product descriptions scraped from LinkedIn website. Moreover, we developed a first-ever test parallel data set of product descriptions. We conduct experiments by building a baseline translation system trained on general domain and then domain-adapted systems using sentence-embedding based corpus filtering and domain-specific sub-corpora extraction. All the systems are tested on our newly developed data set mentioned earlier. Our experimental evaluation reveals that the domain-adapted model based on our proposed approaches outperforms the baseline.
Metadata
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Event Type: | Conference |
Refereed: | Yes |
Subjects: | Computer Science > Machine learning Computer Science > Machine translating |
DCU Faculties and Centres: | DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing Research Institutes and Centres > ADAPT |
Published in: | Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track). . Association for Machine Translation in the Americas. |
Publisher: | Association for Machine Translation in the Americas |
Official URL: | https://aclanthology.org/2022.amta-research.1 |
Copyright Information: | © 2022 Association for Machine Translation in the Americas |
Funders: | Linkedin, ADAPT Centre for Digital Content Technology which is funded under the Science Foundation Ireland (SFI) Research Centres Programme (Grant No. 13/RC/2106). |
ID Code: | 28367 |
Deposited On: | 25 May 2023 13:56 by Maja Popovic . Last Modified 29 May 2023 13:02 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution-Noncommercial 4.0 781kB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record