Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

Approximate sentence retrieval for scalable and efficient example-based machine translation

Ganguly, Debasis orcid logoORCID: 0000-0003-0050-7138, Leveling, Johannes orcid logoORCID: 0000-0003-0603-4191, Dandapat, Sandipan and Jones, Gareth J.F. orcid logoORCID: 0000-0003-2923-8365 (2012) Approximate sentence retrieval for scalable and efficient example-based machine translation. In: 24th International Conference on Computational Linguistics (COLING 2012), 8-15 Dec 2012, Mumbai, India.

Abstract
Approximate sentence matching (ASM) is an important technique for tasks in machine translation (MT) such as example-based MT (EBMT) which influences the translation time and the quality of translation output. We investigate different approaches to find similar sentences in an example base and evaluate their efficiency (runtime), effectiveness, and the resulting quality of translation output. A comparison of approaches demonstrates that i) a sequential computation of the edit distance between an input sentence and all sentences in the example base is not feasible, even when efficient algorithms to compute the edit distance are employed; ii) in-memory data structures such as tries and ternary search trees are more efficient in terms of runtime, but are not scalable for large example bases; iii) standard IR models which only cover material similarity (e.g. term overlap), do not perform well in finding the approximate matches, due to their lack of handling word order and word positions. We propose a new retrieval model derived from language modelling (LM), named LM-ASM, to include positional and ordinal similarities in the matching process, in addition to material similarity. Our IR based retrieval experiments involve reranking the top-ranked documents based on their true edit distance score. Experimental results show that i) IR based approaches result in about 100 times faster translation; ii) LM-ASM approximates edit distance better than standard LM by about 10%; and iii) surprisingly, LM-ASM even improves MT quality by 1:52% in comparison to sequential edit distance computation.
Metadata
Item Type:Conference or Workshop Item (Paper)
Event Type:Conference
Refereed:Yes
Uncontrolled Keywords:Example-Based Machine Translation; Information Retrieval; Approximate Sentence Matching; Edit Distance
Subjects:Computer Science > Machine translating
Computer Science > Information retrieval
DCU Faculties and Centres:Research Institutes and Centres > Centre for Next Generation Localisation (CNGL)
DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Published in: Proceedings of COLING 2012. .
Use License:This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:Science Foundation Ireland
ID Code:20362
Deposited On:13 Jan 2015 14:16 by Gareth Jones . Last Modified 25 Oct 2018 09:57
Documents

Full text available as:

[thumbnail of JLcoling2012_final.pdf]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
192kB
Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record