Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

FaDA: fast document aligner using word embedding

Lohar, Pintu orcid logoORCID: 0000-0002-5328-1585, Ganguly, Debasis orcid logoORCID: 0000-0003-0050-7138, Afli, Haithem orcid logoORCID: 0000-0002-7449-4707, Way, Andy orcid logoORCID: 0000-0001-5736-5930 and Jones, Gareth J.F. orcid logoORCID: 0000-0003-2923-8365 (2016) FaDA: fast document aligner using word embedding. Prague Bulletin of Mathematical Linguistics (106). pp. 169-179. ISSN 1804-0462

Abstract
FaDA is a free/open-source tool for aligning multilingual documents. It employs a novel crosslingual information retrieval (CLIR)-based document-alignment algorithm involving the distances between embedded word vectors in combination with the word overlap between the source-language and the target-language documents. In this approach, we initially construct a pseudo-query from a source-language document. We then represent the target-language documents and the pseudo-query as word vectors to find the average similarity measure between them. This word vector-based similarity measure is then combined with the term overlap-based similarity. Our initial experiments show that s standard Statistical Machine Translation (SMT)- based approach is outperformed by our CLIR-based approach in finding the correct alignment pairs. In addition to this, subsequent experiments with the word vector-based method show further improvements in the performance of the system.
Metadata
Item Type:Article (Published)
Refereed:Yes
Subjects:Computer Science > Machine translating
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Research Institutes and Centres > ADAPT
Publisher:PBML
Official URL:http://dx.doi.org/10.1515/pralin-2016-0016.
Copyright Information:© 2016 De Gruyter Open. Distributed under CC BY-NC-ND
Use License:This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:y Science Foundation Ireland in the ADAPT Centre (Grant 13/RC/2106) (www.adaptcentre.ie) at Dublin City University
ID Code:23310
Deposited On:17 May 2019 13:16 by Thomas Murtagh . Last Modified 05 May 2023 16:27
Documents

Full text available as:

[thumbnail of FaDA[1].pdf]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
185kB
Metrics

Altmetric Badge

Dimensions Badge

Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record