Cross-lingual topical relevance models
Ganguly, Debasis and Leveling, Johannes and Jones, Gareth J.F. (2012) Cross-lingual topical relevance models. In: 24th International Conference on Computational Linguistics (COLING 2012), 8-15 Dec 2012, Mumbai, India.
Full text available as:
Cross-lingual relevance modelling (CLRLM) is a state-of-the-art technique for cross-lingual information retrieval (CLIR) which integrates query term disambiguation and expansion in a unified framework, to directly estimate a model of relevant documents in the target language starting with a query in the source language. However, CLRLM involves integrating a translation model
either on the document side if a parallel corpus is available, or on the query side if a bilingual dictionary is available. For low resourced language pairs, large parallel corpora do not exist and the vocabulary coverage of dictionaries is small, as a result of which RLM-based CLIR fails to obtain satisfactory results. Despite the lack of parallel resources for a majority of language pairs,
the availability of comparable corpora for many languages has grown considerably in the recent years. Existing CLIR techniques such as cross-lingual relevance models, cannot effectively utilise these comparable corpora, since they do not use information from documents in the source language.
We overcome this limitation by using information from retrieved documents in the source language to improve the retrieval quality of the target language documents. More precisely speaking, our model involves a two step approach of first retrieving documents both in the source language and the target language (using query translation), and then improving on the retrieval quality of target language documents by expanding the query with translations of words extracted from the top ranked documents retrieved in the source language which are thematically related (i.e. share the same concept) to the words in the top ranked target language documents. Our key hypothesis is that the query in the source language and its equivalent target language translation retrieve documents which share topics. The ovelapping topics of these top ranked documents in both languages are then used to improve the ranking of the target language documents. Since the model relies on the alignment of topics between language pairs, we call it the cross-lingual topical relevance model (CLTRLM). Experimental results show that the CLTRLM significantly outperforms the standard CLRLM by upto 37% on English-Bengali CLIR, achieving mean average precision (MAP) of up to 60.27% of the Bengali monolingual IR MAP.
Archive Staff Only: edit this record