Code mixing: a challenge for language identification in the language of
social media

Barman, Utsab; Das, Amitava; Wagner, Joachim; Foster, Jennifer

Barman, Utsab, Das, Amitava ORCID: 0000-0003-3418-463X, Wagner, Joachim ORCID: 0000-0002-8290-3849 and Foster, Jennifer ORCID: 0000-0002-7789-4853 (2014) Code mixing: a challenge for language identification in the language of social media. In: First Workshop on Computational Approaches to Code Switching, 25 Oct 2014, Doha, Qatar.

Abstract
Metadata
Downloads
Documents
Metrics

[+][-]

Abstract

In social media communication, multilingual speakers often switch between languages, and, in such an environment, automatic language identification becomes both a necessary and challenging task. In this paper, we describe our work in progress on the problem of automatic language identification for the language of social media. We describe a new dataset that we are in the process of creating, which contains Facebook posts and comments that exhibit code mixing between Bengali, English and Hindi. We also present some preliminary word-level language identification experiments using this dataset. Different techniques are employed, including a simple unsupervised dictionary-based approach, supervised word-level classification with and without contextual clues, and sequence labelling using Conditional Random Fields. We find that the dictionary-based approach is surpassed by supervised classification and sequence labelling, and that it is important to take contextual clues into consideration.

Metadata

Item Type:	Conference or Workshop Item (Paper)
Event Type:	Workshop
Refereed:	Yes
Uncontrolled Keywords:	code switching; language identification; natural language processing; social media
Subjects:	Computer Science > Artificial intelligence Computer Science > Computational linguistics Computer Science > Machine learning
DCU Faculties and Centres:	DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing Research Institutes and Centres > Centre for Next Generation Localisation (CNGL)
Published in:	Proceedings of the First Workshop on Computational Approaches to Code Switching. . Association for Computational Linguistics (ACL).
Publisher:	Association for Computational Linguistics (ACL)
Official URL:	http://dx.doi.org/10.3115/v1/W14-3902
Copyright Information:	© 2014 ACL. CC-BY-4.0
Funders:	Science Foundation Ireland (Grant 12/CE/I2267)
ID Code:	25186
Deposited On:	18 Nov 2020 13:27 by Jennifer Foster . Last Modified 18 Nov 2020 13:27

Documents

Full text available as:

Preview

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution 3.0
194kB

Metrics

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

DORAS | DCU Research Repository

Code mixing: a challenge for language identification in the language of social media

Altmetric Badge

Dimensions Badge

Downloads