Barman, Utsab, Das, Amitava ORCID: 0000-0003-3418-463X, Wagner, Joachim ORCID: 0000-0002-8290-3849 and Foster, Jennifer ORCID: 0000-0002-7789-4853 (2014) Code mixing: a challenge for language identification in the language of social media. In: First Workshop on Computational Approaches to Code Switching, 25 Oct 2014, Doha, Qatar.
Abstract
In social media communication, multilingual speakers often switch between languages, and, in such an environment, automatic language identification becomes
both a necessary and challenging task.
In this paper, we describe our work in
progress on the problem of automatic
language identification for the language
of social media. We describe a new
dataset that we are in the process of creating, which contains Facebook posts and
comments that exhibit code mixing between Bengali, English and Hindi. We
also present some preliminary word-level
language identification experiments using
this dataset. Different techniques are
employed, including a simple unsupervised dictionary-based approach, supervised word-level classification with and
without contextual clues, and sequence labelling using Conditional Random Fields.
We find that the dictionary-based approach
is surpassed by supervised classification
and sequence labelling, and that it is important to take contextual clues into consideration.
Metadata
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Event Type: | Workshop |
Refereed: | Yes |
Uncontrolled Keywords: | code switching; language identification; natural language processing; social media |
Subjects: | Computer Science > Artificial intelligence Computer Science > Computational linguistics Computer Science > Machine learning |
DCU Faculties and Centres: | DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing Research Institutes and Centres > Centre for Next Generation Localisation (CNGL) |
Published in: | Proceedings of the First Workshop on Computational Approaches to Code Switching. . Association for Computational Linguistics (ACL). |
Publisher: | Association for Computational Linguistics (ACL) |
Official URL: | http://dx.doi.org/10.3115/v1/W14-3902 |
Copyright Information: | © 2014 ACL. CC-BY-4.0 |
Funders: | Science Foundation Ireland (Grant 12/CE/I2267) |
ID Code: | 25186 |
Deposited On: | 18 Nov 2020 13:27 by Jennifer Foster . Last Modified 18 Nov 2020 13:27 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution 3.0 194kB |
Metrics
Altmetric Badge
Dimensions Badge
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record