Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

A novel double pruning method for imbalanced data using information entropy and Roulette wheel selection for breast cancer diagnosis

Bacha, Soufiane, Ning, Huansheng, Belarbi, Mostefa, Sarwatt, Doreen Sebastian and Dhelim, Sahraoui orcid logoORCID: 0000-0002-3620-1395 (2025) A novel double pruning method for imbalanced data using information entropy and Roulette wheel selection for breast cancer diagnosis. Knowledge-Based Systems, 330 . p. 114403. ISSN 0950-7051

Abstract
Accurate illness diagnosis is vital for effective treatment and patient safety. Conventional machine learning models are built on the assumption of balanced medical data to perform cancer diagnoses. However, class imbalance remains a crucial challenge that adversely affects the classifier’s performance and reliability, while the existing ensemble solutions are still prone to noisy data and tend to overlook overlaps near decision boundaries. This paper proposes RE-SMOTEBoost, a double-pruning version of the basic ensemble SMOTEBoost method, designed to overcome these drawbacks. First, the proposed method focuses on generating synthetic samples in overlapping regions to better capture the decision boundary by employing roulette wheel selection. Second, it integrates an entropy filter to reduce noisy data and borderline cases, thereby improving the quality of the generated samples. Third, we propose a double regularization penalty to control the proximity of synthetic samples to the decision boundary and prevent the creation of new overlapping samples. These enhancements enable higher-quality oversampling samples, yielding a more balanced training dataset. Experimental findings demonstrated that the proposed method outperforms state-of-the-art methods, achieving a 3.22 improvement in accuracy and an 88.8 reduction in variance compared to the best-performing methods. Practically, the proposed model provides a robust solution for medical applications, handling data scarcity and imbalance arising from data collection difficulties and privacy constraints.
Metadata
Item Type:Article (Published)
Refereed:Yes
Uncontrolled Keywords:Imbalanced data, Cancer data, Information entropy, Class overlapping
Subjects:Computer Science > Artificial intelligence
Computer Science > Information technology
Computer Science > Machine learning
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing
DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Publisher:Elsevier
Official URL:https://www.sciencedirect.com/science/article/pii/...
Copyright Information:Authors
ID Code:32443
Deposited On:23 Mar 2026 09:41 by Sahraoui Dhelim . Last Modified 23 Mar 2026 09:41
Documents

Full text available as:

[thumbnail of A_Novel_Double_Pruning_method_for_Imbalanced_Data_using_Information_Entropy_and_Roulette_Wheel_Selection_for_Breast_Cancer_Diagnosis.pdf] PDF - Archive staff only. This file is embargoed until 25 November 2027 - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution-Noncommercial-No Derivative Works 4.0
1MB
Metrics

Altmetric Badge

Dimensions Badge

Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record