A comparison of sub-word indexing methods for information retrieval

Leveling, Johannes ORCID: 0000-0003-0603-4191 (2009) A comparison of sub-word indexing methods for information retrieval. In: LWA 2009 - Workshop-Woche: Lernen - Wissen - Adaptivität, 21-23 Sept 2009, Darmstadt, Germany.

Abstract
Metadata
Downloads
Documents

[+][-]

Abstract

This paper compares different methods of subword indexing and their performance on the English and German domain-specific document collection of the Cross-language Evaluation Forum (CLEF). Four major methods to index sub-words are investigated and compared to indexing stems: 1) sequences of vowels and consonants, 2) a dictionary-based approach for decompounding, 3) overlapping character n-grams, and 4) Knuth’s algorithm for hyphenation. The performance and effects of sub-word extraction on search time and index size and time are reported for English and German retrieval experiments. The main results are: For English, indexing sub-words does not outperform the baseline using standard retrieval on stemmed word forms (–8% mean average precision (MAP), – 11% geometric MAP (GMAP), +1% relevant and retrieved documents (rel ret) for the best experiment). For German, with the exception of n-grams, all methods for indexing sub-words achieve a higher performance than the stemming baseline. The best performing sub-word indexing methods are to use consonant-vowelconsonant sequences and index them together with word stems (+17% MAP, +37% GMAP, +14% rel ret compared to the baseline), or to index syllable-like sub-words obtained from the hyphenation algorithm together with stems (+9% MAP, +23% GMAP, +11% rel ret).

Metadata

Item Type:	Conference or Workshop Item (Paper)
Event Type:	Workshop
Refereed:	Yes
Uncontrolled Keywords:	mean average precision; MAP; cross language information retrieval
Subjects:	Computer Science > Information retrieval
DCU Faculties and Centres:	Research Institutes and Centres > Centre for Next Generation Localisation (CNGL)
Use License:	This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
ID Code:	16446
Deposited On:	25 Jul 2011 11:16 by Shane Harper . Last Modified 26 Oct 2018 11:24

Documents

Full text available as:

[thumbnail of A_comparison_of_sub-word_indexing_methods_for_information_retrieval.pdf]

Preview

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
157kB

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

DORAS | DCU Research Repository

A comparison of sub-word indexing methods for information retrieval

Downloads