Ganguly, Debasis ORCID: 0000-0003-0050-7138, Leveling, Johannes ORCID: 0000-0003-0603-4191 and Jones, Gareth J.F. ORCID: 0000-0003-2923-8365 (2013) A case study in decompounding for Bengali information retrieval. In: CLEF 2013 - Conference and Labs, 23-26 Sept 2013, Valencia, Spain.
Abstract
Decompounding has been found to improve information retrieval (IR) effectiveness for compounding languages such as Dutch, German, or Finnish. No previous studies, however, exist on the effect of decomposition of compounds in IR for Indian languages. In this case study, we investigate the effect of decompounding for Bengali, a highly agglutinative Indian language. Some unique characteristics of Bengali compounding are: i) only one constituent may be a valid word in contrast to the stricter requirement of both being so; and ii) the first character of the right constituent can be modified by the rules of sandhi in contrast to simple concatenation. While the standard approach of decompounding based on maximization of the total frequency of the constituents formed by candidate split positions has proven beneficial for European languages, our reported experiments in this paper show that such a standard approach does not work particularly well for Bengali IR. As a solution, we firstly propose a more relaxed decompounding where a compound word can be decomposed into only one constituent if the other constituent is not a valid word, and secondly we perform selective decompounding by employing a co-occurrence threshold to ensure that the constituent often co-occurs with the compound word, which in this case is representative of how related are the constituents with the compound. We perform experiments on Bengali ad-hoc IR collections from FIRE 2008 to 2012. Our experiments show that both the relaxed decomposition and the co-occurrence-based constituent selection proves more effective than the standard frequency-based decomposition. improving MAP up to 2:72% and recall up to 1:8%.
Metadata
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Event Type: | Conference |
Refereed: | Yes |
Subjects: | Computer Science > Information retrieval |
DCU Faculties and Centres: | Research Institutes and Centres > Centre for Next Generation Localisation (CNGL) DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing |
Published in: | Proceedings of CLEF 2013. . |
Use License: | This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License |
Funders: | Science Foundation Ireland |
ID Code: | 20374 |
Deposited On: | 14 Jan 2015 11:33 by Gareth Jones . Last Modified 25 Oct 2018 09:43 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
178kB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record