Ganguly, DebasisORCID: 0000-0003-0050-7138, Bandyopadhyay, Ayan, Mitra, Mandar and Jones, Gareth J.F.ORCID: 0000-0003-2923-8365
(2016)
Retrievability of code mixed microblogs.
In: 39th International ACM SIGIR conference on Research and Development in Information Retrieval, 17-21 July 2016, Pisa, Italy.
ISBN 978-1-4503-4069-4
Mixing multiple languages within the same document, a phenomenon
called (linguistic) code mixing or code switching, is a frequent
trend among multilingual users of social media. In the context of
information retrieval (IR), code mixing may affect retrieval effectiveness due to the mixing of different vocabularies with different
collection statistics within a single collection of documents. In
this paper, we investigate the indexing and retrieval strategies for
a mixed collection of documents, comprising of code-mixed and
the monolingual documents. In particular, we address three alternative modes of indexing, namely (a) a single index for the two
sub-collections; (b) a separate index for each sub-collection; and
(c) a clustered index with two individual sub-collection statistics
coupled with the overall one. We make use of the expected retrievability scores of the two classes of documents to empirically
show that indexing strategies (a) and (b) mostly retrieve the monolingual documents at top ranks with standard retrieval approaches.
Our experiments show that, by contrast, the clustered index (c) is
able to alleviate this problem by improving the retrievability of the
code-mixed documents.
SIGIR '16 Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval.
.
Association for Computing Machinery (ACM). ISBN 978-1-4503-4069-4