Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

Code mixing: a challenge for language identification in the language of social media

Barman, Utsab, Das, Amitava orcid logoORCID: 0000-0003-3418-463X, Wagner, Joachim orcid logoORCID: 0000-0002-8290-3849 and Foster, Jennifer orcid logoORCID: 0000-0002-7789-4853 (2014) Code mixing: a challenge for language identification in the language of social media. In: First Workshop on Computational Approaches to Code Switching, 25 Oct 2014, Doha, Qatar.

Abstract
In social media communication, multilingual speakers often switch between languages, and, in such an environment, automatic language identification becomes both a necessary and challenging task. In this paper, we describe our work in progress on the problem of automatic language identification for the language of social media. We describe a new dataset that we are in the process of creating, which contains Facebook posts and comments that exhibit code mixing between Bengali, English and Hindi. We also present some preliminary word-level language identification experiments using this dataset. Different techniques are employed, including a simple unsupervised dictionary-based approach, supervised word-level classification with and without contextual clues, and sequence labelling using Conditional Random Fields. We find that the dictionary-based approach is surpassed by supervised classification and sequence labelling, and that it is important to take contextual clues into consideration.
Metadata
Item Type:Conference or Workshop Item (Paper)
Event Type:Workshop
Refereed:Yes
Uncontrolled Keywords:code switching; language identification; natural language processing; social media
Subjects:Computer Science > Artificial intelligence
Computer Science > Computational linguistics
Computer Science > Machine learning
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Research Institutes and Centres > Centre for Next Generation Localisation (CNGL)
Published in: Proceedings of the First Workshop on Computational Approaches to Code Switching. . Association for Computational Linguistics (ACL).
Publisher:Association for Computational Linguistics (ACL)
Official URL:http://dx.doi.org/10.3115/v1/W14-3902
Copyright Information:© 2014 ACL. CC-BY-4.0
Funders:Science Foundation Ireland (Grant 12/CE/I2267)
ID Code:25186
Deposited On:18 Nov 2020 13:27 by Jennifer Foster . Last Modified 18 Nov 2020 13:27
Documents

Full text available as:

[thumbnail of W14-3902.pdf]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution 3.0
194kB
Metrics

Altmetric Badge

Dimensions Badge

Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record