Ganguly, Debasis ORCID: 0000-0003-0050-7138, Pal, Santanu and Jones, Gareth J.F. ORCID: 0000-0002-4033-9135 (2014) DCU@FIRE-2014: fuzzy queries with rule-based normalization for mixed script information retrieval. In: Forum for Information Retrieval Evaluation (FIRE 2014) workshop, 5-7 Dec 2014, Bangalore, India.
Abstract
We describe the participation of Dublin City University (DCU) in the FIRE-2014 transliteration search task (TST). The TST involves an ad-hoc search over a collection of Hindi film song lyrics. The Hindi language content of each document in the collection is either written in the native Devanagari script or transliterated in Roman script or a combination of both. The queries can be in mixed script as well. The task is challenging primarily because of the vocabulary mismatch which may arise due to the multiple transliteration alternatives. We attempt to address the vocabulary mismatch problem both during the indexing and retrieval stages. During indexing, we apply a rule-based normalization on some character sequences of the transliterated words in order to have a single representation in the index for the multiple transliteration alternatives. During the retrieval phase, we make use of prefix matched fuzzy query terms to account for the morphological variations of the transliterated words. The results show significant improvement over a standard baseline query likelihood language modelling (LM) approach. Additionally, we also apply statistical machine transliteration to train a transliteration model in order to predict the transliteration of out-of-vocabulary words. Surprisingly, even with satisfactory transliteration accuracy, we found that automatic transliteration of query terms degraded retrieval effectiveness.
Metadata
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Event Type: | Workshop |
Refereed: | No |
Uncontrolled Keywords: | Fuzzy Query; Rule-based Normalization; Statistical Machine Transliteration |
Subjects: | Computer Science > Computational linguistics Computer Science > Information retrieval |
DCU Faculties and Centres: | Research Institutes and Centres > Centre for Next Generation Localisation (CNGL) DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing |
Published in: | Proceedings of FIRE 2014. . |
Use License: | This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License |
Funders: | Science Foundation Ireland |
ID Code: | 20383 |
Deposited On: | 15 Jan 2015 15:00 by Gareth Jones . Last Modified 25 Oct 2018 08:55 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
112kB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record