Simulation Study of Imbalanced Classification on High-Dimensional Gene Expression Data

Masithoh Yessi Rochayani(1), Umu Sa'adah(2), Ani Budi Astuti(3),


(1) Department of Statistics, Universitas Diponegoro, Indonesia
(2) Department of Mathematics, Universitas Brawijaya, Indonesia
(3) Department of Statistics, Universitas Brawijaya, Indonesia

Abstract

Purpose: Classification of gene expression helps study disease. However, it faces two obstacles: an imbalanced class and a high dimension. The motivation of this study is to examine the effectiveness of undersampling before feature selection on high-dimensional data with imbalanced classes.

Methods: Least Absolute Shrinkage and Selection Operator (Lasso), which can select features, can handle high-dimensional data modeling. Random undersampling (RUS) can be used to deal with imbalanced classes. The Classification and Decision Tree (CART) algorithm is used to construct a classification model because it can produce an interpretable model. Thirty simulated datasets with varying imbalance ratios are used to test the proposed approaches, which are Lasso-CART and RUS-Lasso-CART. The simulated data are generated from parameters of real gene expression data.

Results: The simulation study results show that when the minority class accounts for more than 25% of the observation size, the Lasso-CART method is appropriate. Meanwhile, RUS-Lasso-CART is effective when the minority class size is at least 20 observations.

Novelty: The novelty of this simulation study is using the RUS-Lasso-CART hybrid method to address the classification problem of high-dimensional gene expression data with imbalanced classes.

Keywords

High-Dimensional Data; Imbalanced Class; Undersampling; Feature Selection; Gene Expression Data

Full Text:

PDF

References

R. Zebari, A. Abdulazeez, D. Zeebaree, D. Zebari, and J. Saeed, “A Comprehensive Review of Dimensionality Reduction Techniques for Feature Selection and Feature Extraction,” J. Appl. Sci. Technol. Trends, vol. 1, no. 2, pp. 56–70, 2020, doi: 10.38094/jastt1224.

Q. Jiang and M. Jin, “Feature Selection for Breast Cancer Classification by Integrating Somatic Mutation and Gene Expression,” Front. Genet., vol. 12, no. February, pp. 1–12, 2021, doi: 10.3389/fgene.2021.629946.

L. Wang, Y. Wang, and Q. Chang, “Feature selection methods for big data bioinformatics: A survey from the search perspective,” Methods, vol. 111, no. August 2016, pp. 21–31, 2016, doi: 10.1016/j.ymeth.2016.08.014.

Y. Saeys, I. Inza, and P. Larrañaga, “A review of feature selection techniques in bioinformatics,” Bioinformatics, vol. 23, no. 19. pp. 2507–2517, Oct. 2007, doi: 10.1093/bioinformatics/btm344.

M. F. Mardiansyah, R. Pratama, M. F. Al Hakim, and B. Rawat, “Optimization of breast cancer classification using feature selection on neural network,” J. Soft Comput. Explor., vol. 3, no. 2, pp. 105–110, 2022, doi: 10.52465/joscex.v3i2.78.

P. Pampouktsi et al., “Techniques of Applied Machine Learning Being Utilized for the Purpose of Selecting and Placing Human Resources within the Public Sector,” J. Inf. Syst. Explor. Res., vol. 1, no. 1, pp. 1–16, 2023.

M. R. Wijaya, R. Saptono, and A. Doewes, “The Effect of Best First and Spreadsubsample on Selection of a Feature Wrapper With Naive Bayes Classifier for The Classification of the Ratio of Inpatients,” Sci. J. Informatics, vol. 3, no. 2, pp. 139–148, Nov. 2016, doi: 10.15294/sji.v3i2.7910.

R. Tibshirani, “Regression Shrinkage and Selection via the Lasso,” J. R. Stat. Soc. Ser. B, vol. 58, no. 1, pp. 267–288, 1996.

S. Biswas, M. Bordoloi, and B. Purkayastha, “Review on Feature Selection and Classification using Neuro-Fuzzy Approaches,” Int. J. Appl. Evol. Comput., vol. 7, no. 4, pp. 28–44, Oct. 2016, doi: 10.4018/ijaec.2016100102.

L. Gao, M. Ye, X. Lu, and D. Huang, “Hybrid Method Based on Information Gain and Support Vector Machine for Gene Selection in Cancer Classification,” Genomics, Proteomics Bioinforma., vol. 15, no. 6, pp. 389–395, 2017, doi: 10.1016/j.gpb.2017.08.002.

B. Sahu and D. Mishra, “A Novel Feature Selection Algorithm using Particle Swarm Optimization for Cancer Microarray Data,” in International Conference on Modeling Optimization and Computing (ICMOC-2012), 2012, no. 38, pp. 27–31, doi: 10.1016/j.proeng.2012.06.005.

Z. Y. Algamal and M. H. Lee, “Penalized Logistic Regression with the Adaptive LASSO for Gene Selection in High-Dimensional Cancer Classification,” Expert Syst. Appl., vol. 42, no. 23, pp. 9326–9332, 2015, doi: 10.1016/j.eswa.2015.08.016.

C. Kang, Y. Huo, L. Xin, B. Tian, and B. Yu, “Feature selection and tumor classification for microarray data using relaxed Lasso and generalized multi-class support vector machine,” J. Theor. Biol., vol. 463, pp. 77–91, 2018, doi: 10.1016/j.jtbi.2018.12.010.

H. Cai, P. Ruan, M. Ng, and T. Akutsu, “Feature weight estimation for gene selection: A local hyperlinear learning approach,” BMC Bioinformatics, vol. 15, p. 70, Mar. 2014, doi: 10.1186/1471-2105-15-70.

S. Liu et al., “Feature selection of gene expression data for Cancer classification using double RBF-kernels,” BMC Bioinformatics, vol. 19, no. 1, pp. 1–14, 2018, doi: 10.1186/s12859-018-2400-2.

B. Pes and G. Lai, “Cost-sensitive learning strategies for high-dimensional and imbalanced data: a comparative study,” PeerJ Comput. Sci., vol. 7, 2021, doi: 10.7717/PEERJ-CS.832.

R. Blagus and L. Lusa, “SMOTE for high-dimensional class-imbalanced data,” 2013.

P. Kaur and A. Gosain, “Comparing the behavior of oversampling and undersampling approach of class imbalance learning by combining class imbalance problem with noise,” 2018, doi: 10.1007/978-981-10-6602-3_3.

A. A. Shanab, T. M. Khoshgoftaar, R. Wald, and J. Van Hulse, “Comparison of approaches to alleviate problems with high-dimensional and class-imbalanced data,” Proc. 2011 IEEE Int. Conf. Inf. Reuse Integr. IRI 2011, pp. 234–239, 2011, doi: 10.1109/IRI.2011.6009552.

H. Yin and K. Gai, “An Empirical Study on Preprocessing High-dimensional Class-imbalanced Data for Classification,” in IEEE 17th International Conference on High Performance Computing and Communications, 2015, pp. 1314–1319, doi: 10.1109/HPCC-CSS-ICESS.2015.205.

M. Y. Rochayani, U. Sa’adah, and A. B. Astuti, “Finding Biomarkers from a High-Dimensional Imbalanced Dataset Using the Hybrid Method of Random Undersampling and Lasso,” ComTech Comput. Math. Eng. Appl., vol. 11, no. 2, pp. 75–81, 2020, doi: 10.21512/comtech.v11i2.6452.

G. Rätsch, S. Sonnenburg, and C. Schäfer, “Learning Interpretable SVMs for Biological Sequence Classification,” BMC Bioinformatics, vol. 7, no. 1, p. S9, 2006, doi: 10.1186/1471-2105-7-S1-S9.

L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees. Chapman and Hall, 1984.

J. Han, M. Kamber, and J. Pei, Data Mining Concepts and Techniques. Waltham: Morgan Kaufmann, 2012.

M. Y. Rochayani, U. Sa’adah, and A. B. Astuti, “Two-stage Gene Selection and Classification for a High-Dimensional Microarray Data,” J. Online Inform., vol. 5, no. 1, pp. 9–18, 2020, doi: 10.15575/join.v5i1.569.

U. Sa’adah, M. Y. Rochayani, and A. B. Astuti, “Knowledge discovery from gene expression dataset using bagging lasso decision tree,” Indones. J. Electr. Eng. Comput. Sci., vol. 21, no. 2, pp. 1151–1159, 2021, doi: 10.11591/ijeecs.

S. Devika, L. Jeyaseelan, and G. Sebastian, “Analysis of sparse data in logistic regression in medical research: A newer approach.,” J. Postgrad. Med., vol. 62, no. 1, pp. 26–31, 2016, doi: 10.4103/0022-3859.173193.

S. Doerken, M. Avalos, E. Lagarde, and M. Schumacher, “Penalized logistic regression with low prevalence exposures beyond high dimensional settings,” PLoS One, vol. 14, no. 5, pp. 1–14, 2019, doi: 10.1371/journal.pone.0217057.

T. P. Morris, I. R. White, and M. J. Crowther, “Using simulation studies to evaluate statistical methods,” Stat. Med., vol. 38, no. 11, pp. 2074–2102, 2019, doi: 10.1002/sim.8086.

G. Romeo and M. Thoresen, “Model selection in high-dimensional noisy data: a simulation study,” J. Stat. Comput. Simul., vol. 89, no. 11, pp. 2031–2050, 2019, doi: 10.1080/00949655.2019.1607345.

T. Hastie, R. Tibshirani, and M. Wainwright, Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman and Hall, 2015.

A. Dal Pozzolo, O. Caelen, and G. Bontempi, “When is Undersampling Effective in Unbalanced Classification Tasks?,” in Machine Learning and Knowledge Discovery in Databases, A. Appice, P. P. Rodrigues, V. Santos Costa, C. Soares, J. Gama, and A. Jorge, Eds. Cham: Springer International Publishing, 2015, pp. 200–215.

N. W. S. Wardhani, M. Y. Rochayani, A. Iriany, A. D. Sulistyono, and P. Lestantyo, “Cross-validation Metrics for Evaluating Classification Performance on Imbalanced Data,” 2019 Int. Conf. Comput. Control. Informatics its Appl. Emerg. Trends Big Data Artif. Intell. IC3INA 2019, pp. 14–18, 2019, doi: 10.1109/IC3INA48034.2019.8949568.

Refbacks

  • There are currently no refbacks.




Scientific Journal of Informatics (SJI)
p-ISSN 2407-7658 | e-ISSN 2460-0040
Published By Department of Computer Science Universitas Negeri Semarang
Website: https://journal.unnes.ac.id/nju/index.php/sji
Email: [email protected]

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.