Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

MultiMWE: building a multi-lingual multi-word expression (MWE) parallel corpora

Han, Lifeng orcid logoORCID: 0000-0002-3221-2185, Jones, Gareth J.F. orcid logoORCID: 0000-0003-2923-8365 and Smeaton, Alan F. orcid logoORCID: 0000-0003-1028-8389 (2020) MultiMWE: building a multi-lingual multi-word expression (MWE) parallel corpora. In: 12th International Conference on Language Resources and Evaluation (LREC), 11-16 May, 2020, Marseille, France. (Virtual).

Multi-word expressions (MWEs) are a hot topic in research in natural language processing (NLP), including topics such as MWE detection, MWE decomposition, and research investigating the exploitation of MWEs in other NLP fields such as Machine Translation. However, the availability of bilingual or multi-lingual MWE corpora is very limited. The only bilingual MWE corpora that we are aware of is from the PARSEME (PARSing and Multi-word Expressions) EU project. This is a small collection of only 871 pairs of English-German MWEs. In this paper, we present multi-lingual and bilingual MWE corpora that we have extracted from root parallel corpora. Our collections are 3,159,226 and 143,042 bilingual MWE pairs for German-English and Chinese-English respectively after filtering. We examine the quality of these extracted bilingual MWEs in MT experiments. Our initial experiments applying MWEs in MT show improved translation performances on MWE terms in qualitative analysis and better general evaluation scores in quantitative analysis, on both German-English and Chinese-English language pairs. We follow a standard experimental pipeline to create our MultiMWE corpora which are available online. Researchers can use this free corpus for their own models or use them in a knowledge base as model features.
Item Type:Conference or Workshop Item (Paper)
Event Type:Conference
Uncontrolled Keywords:Machine Translation; Natural Language Processing; Multi-word Expression; Parallel Corpora; Chinese-English; German-English
Subjects:Computer Science > Artificial intelligence
Computer Science > Computational linguistics
Computer Science > Information technology
Computer Science > Machine translating
Computer Science > Software engineering
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Research Institutes and Centres > INSIGHT Centre for Data Analytics
Research Institutes and Centres > ADAPT
Published in: Proceedings of the 12th Language Resources and Evaluation Conference (LREC2020). . European Language Resources Association (ELRA).
Publisher:European Language Resources Association (ELRA)
Official URL:https://www.aclweb.org/anthology/2020.lrec-1.363
Copyright Information:© 2020 The Authors. Creative Commons Attribution CC-BY-NC 4.0
Use License:This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:Science Foundation Ireland Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund, Science Foundation Ireland under grant number SFI/12/RC/2289 (Insight Centre)., European Regional Development Fund
ID Code:24502
Deposited On:29 May 2020 11:33 by Lifeng Han . Last Modified 20 Sep 2021 13:08

Full text available as:

[thumbnail of MultiMWE- Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora.pdf]
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader


Downloads per month over past year

Archive Staff Only: edit this record