Identifying Citizen-Related Issues from Social Media Using LLM-Based Data Augmentation

dos Santos, Vitor Gaboardi; Santos, Guto Leoni; Lynn, Theo; Benatallah, Boualem

dos Santos, Vitor Gaboardi, Santos, Guto Leoni ORCID: 0000-0002-0257-4214, Lynn, Theo ORCID: 0000-0001-9284-7580 and Benatallah, Boualem (2024) Identifying Citizen-Related Issues from Social Media Using LLM-Based Data Augmentation. In: International Conference on Advanced Information Systems Engineering (CAiSE 2024). ISBN https://link.springer.com/chapter/10.1007/978-3-031-61057-8_31#citeas

Abstract
Metadata
Downloads
Documents

[+][-]

Abstract

Social media platforms, such as Twitter, offer an accessible way for people to share information and perspectives on a wide range of topics. Such citizen discourse can be a valuable source of information and offer policymakers and researchers insights into public sentiment, needs, and suggestions, guiding more informed and responsive planning and policy decisions. In this paper, we propose a novel approach using Large Language Models (LLMs) for data augmentation and multi-class classification to extract domain-specific data from tweets and identify issues raised by citizens thus providing policymakers and social science researchers with valuable data to formulate effective plans and policies for improving services. This approach involves initially collecting data from Twitter using specific keywords and manually labelling a subset of the acquired data. Then, we introduce a new data augmentation strategy employing a LLM that leverages the initial human-labelled data to enhance text diversity and address imbalances in the dataset. Finally, we use the manual-labelled and augmented data to fine-tune different LLMs to classify texts across multiple topics. We test our approach considering the identification of issues related to the cycling domain as case study, detecting tweets across eleven categories associated with infrastructure, safety, and accidents. Through fine-tuning BERT-based models and experimenting with zero- and few-shot prompts with GPT for tweet classification, we accomplished an accuracy of up to 90.9%.

Metadata

Item Type:	Conference or Workshop Item (Paper)
Event Type:	Conference
Refereed:	Yes
Uncontrolled Keywords:	Tweet classification, BERT, GPT, LLM, Cycling
Subjects:	Computer Science > Computer networks Computer Science > World Wide Web
DCU Faculties and Centres:	DCU Faculties and Schools > DCU Business School
Published in:	Advanced Information Systems Engineering. CAiSE 2024. Lecture Notes in Computer Science (LNCS) 14663. Springer, Cham. ISBN https://link.springer.com/chapter/10.1007/978-3-031-61057-8_31#citeas
Publisher:	Springer, Cham
Official URL:	https://link.springer.com/chapter/10.1007/978-3-03...
Copyright Information:	Authors
ID Code:	32855
Deposited On:	02 Jul 2026 10:43 by Tam Nguyen . Last Modified 02 Jul 2026 10:43

Documents

Full text available as:

[thumbnail of Identifying Citizen-Related Issues From Social.pdf]

Preview

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution 4.0
1MB

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

DORAS | DCU Research Repository

Identifying Citizen-Related Issues from Social Media Using LLM-Based Data Augmentation

Downloads