dos Santos, Vitor Gaboardi, Santos, Guto Leoni
ORCID: 0000-0002-0257-4214, Lynn, Theo
ORCID: 0000-0001-9284-7580 and Benatallah, Boualem
(2024)
Identifying Citizen-Related Issues from Social Media Using LLM-Based Data Augmentation.
In: International Conference on Advanced Information Systems Engineering (CAiSE 2024).
ISBN https://link.springer.com/chapter/10.1007/978-3-031-61057-8_31#citeas
Abstract
Social media platforms, such as Twitter, offer an accessible way for people to share information and perspectives on a wide range of topics. Such citizen discourse can be a valuable source of information and offer policymakers and researchers insights into public sentiment, needs, and suggestions, guiding more informed and responsive planning and policy decisions. In this paper, we propose a novel approach using Large Language Models (LLMs) for data augmentation and multi-class classification to extract domain-specific data from tweets and identify issues raised by citizens thus providing policymakers and social science researchers with valuable data to formulate effective plans and policies for improving services. This approach involves initially collecting data from Twitter using specific keywords and manually labelling a subset of the acquired data. Then, we introduce a new data augmentation strategy employing a LLM that leverages the initial human-labelled data to enhance text diversity and address imbalances in the dataset. Finally, we use the manual-labelled and augmented data to fine-tune different LLMs to classify texts across multiple topics. We test our approach considering the identification of issues related to the cycling domain as case study, detecting tweets across eleven categories associated with infrastructure, safety, and accidents. Through fine-tuning BERT-based models and experimenting with zero- and few-shot prompts with GPT for tweet classification, we accomplished an accuracy of up to 90.9%.
Metadata
| Item Type: | Conference or Workshop Item (Paper) |
|---|---|
| Event Type: | Conference |
| Refereed: | Yes |
| Uncontrolled Keywords: | Tweet classification, BERT, GPT, LLM, Cycling |
| Subjects: | Computer Science > Computer networks Computer Science > World Wide Web |
| DCU Faculties and Centres: | DCU Faculties and Schools > DCU Business School |
| Published in: | Advanced Information Systems Engineering. CAiSE 2024. Lecture Notes in Computer Science (LNCS) 14663. Springer, Cham. ISBN https://link.springer.com/chapter/10.1007/978-3-031-61057-8_31#citeas |
| Publisher: | Springer, Cham |
| Official URL: | https://link.springer.com/chapter/10.1007/978-3-03... |
| Copyright Information: | Authors |
| ID Code: | 32855 |
| Deposited On: | 02 Jul 2026 10:43 by Tam Nguyen . Last Modified 02 Jul 2026 10:43 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution 4.0 1MB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record