Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

A case study in decompounding for Bengali information retrieval

Ganguly, Debasis orcid logoORCID: 0000-0003-0050-7138, Leveling, Johannes orcid logoORCID: 0000-0003-0603-4191 and Jones, Gareth J.F. orcid logoORCID: 0000-0003-2923-8365 (2013) A case study in decompounding for Bengali information retrieval. In: CLEF 2013 - Conference and Labs, 23-26 Sept 2013, Valencia, Spain.

Abstract
Decompounding has been found to improve information retrieval (IR) effectiveness for compounding languages such as Dutch, German, or Finnish. No previous studies, however, exist on the effect of decomposition of compounds in IR for Indian languages. In this case study, we investigate the effect of decompounding for Bengali, a highly agglutinative Indian language. Some unique characteristics of Bengali compounding are: i) only one constituent may be a valid word in contrast to the stricter requirement of both being so; and ii) the first character of the right constituent can be modified by the rules of sandhi in contrast to simple concatenation. While the standard approach of decompounding based on maximization of the total frequency of the constituents formed by candidate split positions has proven beneficial for European languages, our reported experiments in this paper show that such a standard approach does not work particularly well for Bengali IR. As a solution, we firstly propose a more relaxed decompounding where a compound word can be decomposed into only one constituent if the other constituent is not a valid word, and secondly we perform selective decompounding by employing a co-occurrence threshold to ensure that the constituent often co-occurs with the compound word, which in this case is representative of how related are the constituents with the compound. We perform experiments on Bengali ad-hoc IR collections from FIRE 2008 to 2012. Our experiments show that both the relaxed decomposition and the co-occurrence-based constituent selection proves more effective than the standard frequency-based decomposition. improving MAP up to 2:72% and recall up to 1:8%.
Metadata
Item Type:Conference or Workshop Item (Paper)
Event Type:Conference
Refereed:Yes
Subjects:Computer Science > Information retrieval
DCU Faculties and Centres:Research Institutes and Centres > Centre for Next Generation Localisation (CNGL)
DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Published in: Proceedings of CLEF 2013. .
Use License:This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:Science Foundation Ireland
ID Code:20374
Deposited On:14 Jan 2015 11:33 by Gareth Jones . Last Modified 25 Oct 2018 09:43
Documents

Full text available as:

[thumbnail of bncompound.pdf]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
178kB
Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record