Knowledge Discovery from Confusion Matrix of Pruned CART in Imbalanced Microarray Data Ovarian Cancer Classification

Ni Kadek Emik Sapitri(1), Umu Sa’adah(2), Nur Shofianah(3),


(1) Department of Mathematics, Universitas Brawijaya, Indonesia
(2) Department of Mathematics, Universitas Brawijaya, Indonesia
(3) Department of Mathematics, Universitas Brawijaya, Indonesia

Abstract

Purpose: The results of microarray data analysis is important in cancer diagnosis, especially in early stages asymptomatic cancers like ovarian cancer. One of the challenges in analyzing microarray data is the problem of imbalanced data. Unfortunately, research that carries out cancer classification from microarray data often ignores this challenge, so that it doesn’t use appropriate evaluation metrics. It makes the results biased towards the majority class. This study uses a popular evaluation metric “accuracy” and an evaluation metric that is suitable for imbalanced data “balanced accuracy (BA)” to gain information from the confusion matrix regarding accuracy and BA values in case of ovarian cancer classification.

Methods: This study use Classification and Regression Tree (CART) as the classifier. CART optimized by pruning. CART optimal is determined from the results of CART complexity analysis and confusion matrix.

Results: The confusion matrix and CART interpretations in this research show that CART with low complexity is still able to predict majority class respondents well. However, when none of the data in the minority class was classified correctly, the accuracy value was still quite high, namely 86.97% and 88.03% respectively at the training and testing stages, while the BA value at both stages was only 50%.

Novelty: It is very important to ensure that the evaluation metrics used match the characteristics of the data being processed. This research illustrate the difference between accuracy and BA. It concluded that that classification of an imbalanced dataset without doing resampling can use BA as evaluation metric, because based on the results, BA is more fairly to both classes.

Keywords

Imbalanced microarray data; Ovarian cancer; Confusion matrix; Evaluation metrics; CART

Full Text:

PDF

References

H. Almazrua and H. Alshamlan, “A Comprehensive Survey of Recent Hybrid Feature Selection Methods in Cancer Microarray Gene Expression Data,” IEEE Access, vol. 10, pp. 71427–71449, 2022.

M. Y. Rochayani, U. Sa’adah, and A. B. Astuti, “Two-stage Gene Selection and Classification for a High-Dimensional Microarray Data,” J. Online Inform., vol. 5, no. 1, pp. 9–18, 2020.

E. Lotfi and A. Keshavarz, “Gene expression microarray classification using PCA–BEL,” Comput. Biol. Med., vol. 54, pp. 180–187, 2014.

M. Rostami, S. Forouzandeh, K. Berahmand, M. Soltani, M. Shahsavari, and M. Oussalah, “Gene selection for microarray data classification via multi-objective graph theoretic-based method,” Artif. Intell. Med., vol. 123, p. 102228, 2022.

M. Y. Rochayani, U. Sa’adah, and A. B. Astuti, “Simulation Study of Imbalanced Classification on High-Dimensional Gene Expression Data,” Sci. J. Informatics, vol. 10, no. 1, pp. 45–54, 2023.

Y. He, J. Zhou, Y. Lin, and T. Zhu, “A class imbalance-aware Relief algorithm for the classi fi cation of tumors using microarray gene expression data,” Comput. Biol. Chem., vol. 80, pp. 121–127, 2019.

A. Telikani, A. Tahmassebi, W. Banzhaf, and A. H. Gandomi, “Evolutionary Machine Learning: A Survey,” ACM Comput. Surv., vol. 54, no. 8, pp. 1–35, 2021.

M. Grandini, E. Bagli, and G. Visani, “Metrics for multi-class classification: An overview,” arXiv, pp. 1–17, 2020.

N. A. Al-thanoon, O. S. Qasim, and Z. Y. Algamal, “Tuning parameter estimation in SCAD-support vector machine using firefly algorithm with application in gene selection and cancer classification,” Comput. Biol. Med., vol. 103, pp. 262–268, 2018.

T. N. Nuklianggraita, Adiwijaya, and A. Aditsania, “On the Feature Selection of Microarray Data for Cancer Detection based on Random Forest Classifier,” Infotel, vol. 12, no. 3, pp. 89–96, 2020.

A. M. Alharthi, M. H. Lee, and Z. Y. Algamal, “Gene selection and classification of microarray gene expression data based on a new adaptive L1 -norm elastic net penalty,” Informatics Med. Unlocked, vol. 24, p. 100622, 2021.

G. Roffo, S. Melzi, U. Castellani, A. Vinciarelli, and M. Cristani, “Infinite Feature Selection: A Graph-based Feature Filtering Approach,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 12, pp. 4396–4410, 2021.

D. Chicco and G. Jurman, “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,” BMC Genomics, vol. 21, no. 6, pp. 1–13, 2020.

C. Slatnik and E. Duff, “Ovarian cancer: Ensuring early diagnosis,” Nurse Pract., vol. 40, no. 9, pp. 47–54, 2015.

A. B. Harsono, “Kanker Ovarium: ‘The Silent Killer,’” Indones. J. Obstet. Gynecol. Sci., vol. 3, no. 1, pp. 1–6, 2020.

National Cancer Institute, “SEER Cancer Stat Facts: Ovarian Cancer,” 2022. [Online]. Available: https://seer.cancer.gov/statfacts/html/ovary.html. [Accessed: 09-Jul-2023].

L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees. Chapman and Hall, 1984.

S. Singh and P. Gupta, “Comparative Study ID3, CART and C4.5 Decision Tree Algorithm: A Survey,” Int. J. Adv. Inf. Sci. Technol., vol. 27, no. 27, pp. 97–103, 2014.

N. M. Tuan, huynh T. K. Chi, and N. Van Hop, “A Hybrid Machine Learning Approach in Predicting E-Commerce Supply Chain Risks,” in 14th International Conference on Knowledge and Systems Engineering (KSE), 2022, pp. 1–6.

G. Kunapuli, Ensemble Methods for Machine Learning, 6th ed. New York: Manning Publications, 2022.

Y. Song and Y. Lu, “Decision tree methods: applications for classification and prediction,” Shanghai Arch. Psychiatry, vol. 27, no. 2, pp. 130–135, 2015.

U. Saadah, M. Y. Rochayani, D. W. Lestari, and D. A. Lusia, “Pohon Keputusan,” in Kupas Tuntas Algoritma Data Mining dan Implementasinya Menggunakan R, Malang: UBPress, 2021, pp. 143–168.

Ž. Đ. Vujović, “Classification Model Evaluation Metrics,” Int. J. Adv. Comput. Sci. Appl., vol. 12, no. 6, pp. 1–8, 2021.

D. Chicco, N. Tötsch, and G. Jurman, “The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation,” BioData Min., vol. 14, pp. 1–22, 2021.

G. Stiglic and P. Kokol, “Stability of Ranked Gene Lists in Large Microarray Analysis Studies,” J. Biomed. Biotechnol., vol. 2010, 2010.

X. Tang, S. X. D. Tan, and H. Chen, “SVM Based Intrusion Detection Using Nonlinear Scaling Scheme,” in 4th IEEE International Conference on Solid-State and Integrated Circuit Technology (ICSICT), 2018, pp. 1–4.

A. L. Prodromidis and S. J. Stolfo, “Cost Complexity-Based Pruning of Ensemble Classifiers,” Knowl. Inf. Syst., vol. 3, no. 4, pp. 449–469, 2001.

B. R. Kiran and J. Serra, “Cost-complexity pruning of random forests,” Artif. Intell. Med., pp. 222–232, 2017.

U. Sa’adah, M. Y. Rochayani, and A. B. Astuti, “Knowledge discovery from gene expression dataset using bagging lasso decision tree,” Indones. J. Electr. Eng. Comput. Sci., vol. 21, no. 2, pp. 1151–1159, 2020.

R. Tibshirani, “Regression Shrinkage and Selection via the Lasso,” J. R. Stat. Soc. B, vol. 58, no. 1, pp. 267–288, 1996.

Refbacks

  • There are currently no refbacks.




Scientific Journal of Informatics (SJI)
p-ISSN 2407-7658 | e-ISSN 2460-0040
Published By Department of Computer Science Universitas Negeri Semarang
Website: https://journal.unnes.ac.id/nju/index.php/sji
Email: [email protected]

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.