Ganguly, Debasis ORCID: 0000-0003-0050-7138, Leveling, Johannes ORCID: 0000-0003-0603-4191, Dandapat, Sandipan and Jones, Gareth J.F. ORCID: 0000-0003-2923-8365 (2012) Approximate sentence retrieval for scalable and efficient example-based machine translation. In: 24th International Conference on Computational Linguistics (COLING 2012), 8-15 Dec 2012, Mumbai, India.
Abstract
Approximate sentence matching (ASM) is an important technique for tasks in machine translation (MT) such as example-based MT (EBMT) which influences the translation time and the quality of translation output. We investigate different approaches to find similar sentences in an example base and evaluate their efficiency (runtime), effectiveness, and the resulting quality of translation output. A comparison of approaches demonstrates that i) a sequential computation of the edit distance between an input sentence and all sentences in the example base is not feasible, even when efficient algorithms to compute the edit distance are employed; ii) in-memory data structures
such as tries and ternary search trees are more efficient in terms of runtime, but are not scalable for large example bases; iii) standard IR models which only cover material similarity (e.g. term overlap), do not perform well in finding the approximate matches, due to their lack of handling word order and word positions. We propose a new retrieval model derived from language modelling (LM), named LM-ASM, to include positional and ordinal similarities in the matching process, in addition to material similarity. Our IR based retrieval experiments involve reranking the top-ranked documents based on their true edit distance score. Experimental results show that i) IR based approaches result in about 100 times faster translation; ii) LM-ASM approximates edit distance better than standard LM by about 10%; and iii) surprisingly, LM-ASM even improves MT quality by 1:52% in comparison to sequential edit distance computation.
Metadata
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Event Type: | Conference |
Refereed: | Yes |
Uncontrolled Keywords: | Example-Based Machine Translation; Information Retrieval; Approximate Sentence Matching; Edit Distance |
Subjects: | Computer Science > Machine translating Computer Science > Information retrieval |
DCU Faculties and Centres: | Research Institutes and Centres > Centre for Next Generation Localisation (CNGL) DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing |
Published in: | Proceedings of COLING 2012. . |
Use License: | This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License |
Funders: | Science Foundation Ireland |
ID Code: | 20362 |
Deposited On: | 13 Jan 2015 14:16 by Gareth Jones . Last Modified 25 Oct 2018 09:57 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
192kB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record