DCU@FIRE-2014: fuzzy queries with rule-based normalization for mixed script information retrieval

Ganguly, Debasis; Pal, Santanu; Jones, Gareth J.F.

Ganguly, Debasis ORCID: 0000-0003-0050-7138, Pal, Santanu and Jones, Gareth J.F. ORCID: 0000-0002-4033-9135 (2014) DCU@FIRE-2014: fuzzy queries with rule-based normalization for mixed script information retrieval. In: Forum for Information Retrieval Evaluation (FIRE 2014) workshop, 5-7 Dec 2014, Bangalore, India.

Abstract
Metadata
Downloads
Documents

[+][-]

Abstract

We describe the participation of Dublin City University (DCU) in the FIRE-2014 transliteration search task (TST). The TST involves an ad-hoc search over a collection of Hindi film song lyrics. The Hindi language content of each document in the collection is either written in the native Devanagari script or transliterated in Roman script or a combination of both. The queries can be in mixed script as well. The task is challenging primarily because of the vocabulary mismatch which may arise due to the multiple transliteration alternatives. We attempt to address the vocabulary mismatch problem both during the indexing and retrieval stages. During indexing, we apply a rule-based normalization on some character sequences of the transliterated words in order to have a single representation in the index for the multiple transliteration alternatives. During the retrieval phase, we make use of prefix matched fuzzy query terms to account for the morphological variations of the transliterated words. The results show significant improvement over a standard baseline query likelihood language modelling (LM) approach. Additionally, we also apply statistical machine transliteration to train a transliteration model in order to predict the transliteration of out-of-vocabulary words. Surprisingly, even with satisfactory transliteration accuracy, we found that automatic transliteration of query terms degraded retrieval effectiveness.

Metadata

Item Type:	Conference or Workshop Item (Paper)
Event Type:	Workshop
Refereed:	No
Uncontrolled Keywords:	Fuzzy Query; Rule-based Normalization; Statistical Machine Transliteration
Subjects:	Computer Science > Computational linguistics Computer Science > Information retrieval
DCU Faculties and Centres:	Research Institutes and Centres > Centre for Next Generation Localisation (CNGL) DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Published in:	Proceedings of FIRE 2014. .
Use License:	This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:	Science Foundation Ireland
ID Code:	20383
Deposited On:	15 Jan 2015 15:00 by Gareth Jones . Last Modified 25 Oct 2018 08:55

Documents

Full text available as:

Preview

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
112kB

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

DORAS | DCU Research Repository

DCU@FIRE-2014: fuzzy queries with rule-based normalization for mixed script information retrieval

Downloads