Integrating C4.5 and K-Nearest Neighbor Imputation with Relief Feature Selection for Enhancing Breast Cancer Diagnosis

Authors

  • Aji Purwinarko Information Technology Studies Program, Faculty of Mathematics and Natural Sciences, Universitas Negeri Semarang, Indonesia Author
  • Kholiq Budiman Information Technology Studies Program, Faculty of Mathematics and Natural Sciences, Universitas Negeri Semarang, Indonesia Author
  • Arif Widiyatmoko Science Education Studies Program, Faculty of Mathematics and Natural Sciences, Universitas Negeri Semarang, Indonesia Author
  • Fitri Arum Sasi Biology Studies Program, Faculty of Mathematics and Natural Sciences, Universitas Negeri Semarang, Indonesia Author
  • Wahyu Hardyanto Physics Studies Program, Faculty of Mathematics and Natural Sciences, Universitas Negeri Semarang, Indonesia Author

DOI:

https://doi.org/10.15294/sji.v12i1.21673

Keywords:

Breast cancer classification, C4.5, KNN imputation, Relief feature selection, Machine learning

Abstract

Purpose: Breast cancer remains a significant cause of mortality among women, requiring accurate diagnostic methods. Traditional classification models often face accuracy challenges due to missing values and irrelevant features. This investigation advances the classification of breast cancer through the amalgamation of the C4.5 algorithm with K-Nearest Neighbor (KNN) imputation and Relief feature selection methodologies, thereby augmenting data integrity and enhancing classification efficacy.

Methods: The Wisconsin Breast Cancer Database (WBCD) was the core reference for evaluating the proposed methodology. KNN imputation addressed missing values, while Relief selected the most relevant features. The C4.5 algorithm executed training by utilizing data segregations in the corresponding proportions of 70:30, 80:20, and 90:10, with its efficiency gauged through a range of metrics, particularly accuracy, precision, recall, and F1-score.

Result: This innovative methodology achieved the highest classification accuracy of 98.57%, surpassing several existing models. Particularly noteworthy, the strategy being analyzed exhibited remarkable success relative to PSO-C4.5 (96.49%), EBL-RBFNN (98.40%), Gaussian Naïve Bayes (97.50%), and t-SNE (98.20%), demonstrating associated advancements of 2.08%, 0.17%, 1.07%, and 0.37%. These results confirm its effectiveness in handling missing values and selecting relevant features.

Novelty: Unlike prior studies that addressed missing values and feature selection separately, this research integrates both techniques, enhancing classification accuracy and computational efficiency. The findings suggest that this approach provides a reliable breast cancer diagnosis method. Future work could explore deep learning integration and validation on larger datasets to improve generalizability.

Downloads

Published

29-05-2025

Article ID

21673

Issue

Section

Articles

How to Cite

Integrating C4.5 and K-Nearest Neighbor Imputation with Relief Feature Selection for Enhancing Breast Cancer Diagnosis. (2025). Scientific Journal of Informatics, 12(1), 107-118. https://doi.org/10.15294/sji.v12i1.21673