Parallel data extraction using word embeddings

Lohar, Pintu; Way, Andy

Lohar, Pintu ORCID: 0000-0002-5328-1585 and Way, Andy ORCID: 0000-0001-5736-5930 (2020) Parallel data extraction using word embeddings. In: NLPTA 2020 : International Conference on NLP Techniques and Applications, 28-29 Nov 2020, London, UK (Online).

Abstract
Metadata
Downloads
Documents
Metrics

[+][-]

Abstract

Building a robust MT system requires a sufficiently large parallel corpus to be available as training data. In this paper, we propose to automatically extract parallel sentences from comparable corpora without using any MT system or even any parallel corpus at all. Instead, we use crosslingual information retrieval (CLIR), average word embeddings, text similarity and a bilingual dictionary, thus saving a significant amount of time and effort as no MT system is involved in this process. We conduct experiments on two different kinds of data: (i) formal texts from news domain, and (ii) user-generated content (UGC) from hotel reviews. The automatically extracted sentence pairs are then added to the already available parallel training data and the extended translation models are built from the concatenated data sets. Finally, we compare the performance of our new extended models against the baseline models built from the available data. The experimental evaluation reveals that our proposed approach is capable of improving the translation outputs for both the formal texts and UGC.

Metadata

Item Type:	Conference or Workshop Item (Paper)
Event Type:	Conference
Refereed:	Yes
Uncontrolled Keywords:	parallel data; user-generated content; word embeddings; text similarity; comparable corpora
Subjects:	Computer Science > Computational linguistics Computer Science > Machine translating
DCU Faculties and Centres:	DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing Research Institutes and Centres > ADAPT
Published in:	Computer Science & Information Technology. 10(15). AIRCC Publishing Corporation.
Publisher:	AIRCC Publishing Corporation
Official URL:	http://dx.doi.org/10.5121/csit.2020.101521
Copyright Information:	© 2020 AIRCC Publishing Corporation CC-BY
Funders:	Science Foundation Ireland SFI Research Centres Programme (Grant 13/RC/2106).
ID Code:	25339
Deposited On:	12 Jan 2021 16:47 by Pintu Lohar . Last Modified 05 May 2023 16:33

Documents

Full text available as:

Preview

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
1MB

Metrics

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

DORAS | DCU Research Repository

Parallel data extraction using word embeddings

Altmetric Badge

Dimensions Badge

Downloads