Cross-lingual topical relevance models

Ganguly, Debasis ORCID: 0000-0003-0603-4191, Leveling, Johannes ORCID: 0000-0003-0603-4191 and Jones, Gareth J.F. ORCID: 0000-0003-2923-8365 (2012) Cross-lingual topical relevance models. In: 24th International Conference on Computational Linguistics (COLING 2012), 8-15 Dec 2012, Mumbai, India.

Abstract
Metadata
Downloads
Documents

[+][-]

Abstract

Cross-lingual relevance modelling (CLRLM) is a state-of-the-art technique for cross-lingual information retrieval (CLIR) which integrates query term disambiguation and expansion in a unified framework, to directly estimate a model of relevant documents in the target language starting with a query in the source language. However, CLRLM involves integrating a translation model either on the document side if a parallel corpus is available, or on the query side if a bilingual dictionary is available. For low resourced language pairs, large parallel corpora do not exist and the vocabulary coverage of dictionaries is small, as a result of which RLM-based CLIR fails to obtain satisfactory results. Despite the lack of parallel resources for a majority of language pairs, the availability of comparable corpora for many languages has grown considerably in the recent years. Existing CLIR techniques such as cross-lingual relevance models, cannot effectively utilise these comparable corpora, since they do not use information from documents in the source language. We overcome this limitation by using information from retrieved documents in the source language to improve the retrieval quality of the target language documents. More precisely speaking, our model involves a two step approach of first retrieving documents both in the source language and the target language (using query translation), and then improving on the retrieval quality of target language documents by expanding the query with translations of words extracted from the top ranked documents retrieved in the source language which are thematically related (i.e. share the same concept) to the words in the top ranked target language documents. Our key hypothesis is that the query in the source language and its equivalent target language translation retrieve documents which share topics. The ovelapping topics of these top ranked documents in both languages are then used to improve the ranking of the target language documents. Since the model relies on the alignment of topics between language pairs, we call it the cross-lingual topical relevance model (CLTRLM). Experimental results show that the CLTRLM significantly outperforms the standard CLRLM by upto 37% on English-Bengali CLIR, achieving mean average precision (MAP) of up to 60.27% of the Bengali monolingual IR MAP.

Metadata

Item Type:	Conference or Workshop Item (Paper)
Event Type:	Conference
Refereed:	Yes
Uncontrolled Keywords:	Cross-lingual Information Retrieval; Relevance Model; Topic Model; Pseudo-Relevance Feedback; Latent Dirichlet Allocation
Subjects:	Computer Science > Machine translating Computer Science > Information retrieval
DCU Faculties and Centres:	Research Initiatives and Centres > Centre for Next Generation Localisation (CNGL) DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Published in:	Proceedings of COLING 2012. .
Use License:	This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:	Science Foundation Ireland
ID Code:	20361
Deposited On:	13 Jan 2015 14:08 by Gareth Jones . Last Modified 25 Oct 2018 09:59

Documents

Full text available as:

Preview

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
708kB

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

Altmetric