Prediction of Life Expectancy of Lung Cancer Patients Post Thoracic Surgery using K-Nearest Neighbors and Bat Algorithm

ABSTRACT


Introduction
Cancer is a major cause of death and a significant obstacle to increasing life expectancy in every country in the world. According to estimates by the World Health Organization (WHO) in 2019, cancer is the first or second cause of death before the age of 70 in many countries. One of the deadliest cancers is lung cancer which accounts for 11.6% of all cancer diagnoses in the world (GLOBOCAN, 2020). There are several options for treating patients with lung cancer, namely thoracic surgery, radiotherapy, or chemotherapy. One of the treatments for lung cancer patients that can be done is thoracic surgery (Duma et al., 2019). Thoracic surgery is one of the rapidly growing specialties in all fields of surgery, both technically and technologically for the treatment of chest diseases (Sihoe, 2022).
Early treatment can be done by reducing mortality after thoracic surgery, one of which is collecting data in the form of information about lung cancer patients after thoracic surgery. Many attributes in data cannot produce accurate information, then a feature selection or attribute selection is needed for produce accurate information. Feature selection is the process of selecting a minimal representative feature subset from the original feature set to meet the measurement criteria. One type of algorithm used for feature selection is the bat algorithm. The bat algorithm is a metaheuristic algorithm inspired by the echolocation habit of bats (Chakri et al., 2018).
Need a machine learning method that is used for classification after optimization with the bat algorithm. One of the machine learning methods that is often used is K-Nearest Neighbors (KNN). KNN is a method that classifies unknown data by measuring the distance or similarity of a known data and then comparing it with a data set (Pawlovsky, 2018). The principle of the KNN method is to find the most similar data samples from the same class and have a high probability. In general, this method begins by finding the k closest neighbors of a query in the training data set and then predicts the query as the main class of the KNN (Zhang et al., 2018).
The disadvantage of KNN is that it must initialize the closest number of k parameter values and must know the attributes or features selected in distance-based learning to get the best results because sufficient computation requires calculations from each data test (Sugiarta et al., 2019).
The combination of the bat and KNN algorithms or the BA-KNN algorithm is a combination of the binary bat and KNN algorithms (Sugiarta et al., 2019). Combining this algorithm, combining the echolocation process of bats with feature selection for the thoracic surgery dataset, then a machine learning method, namely KNN, is used for classification which is used as the value of the fitness function.

Bat Algorithm
Bat algorithm or BA is an inspiration from the echolocation habit of bats which is applied to a metaheuristic algorithm (Chakri et al., 2018). Echolocation is the ability of bats to identify objects using ultrasonic sound in their surroundings. Bats use this echolocation ability to avoid objects and find food. BA imitates the echolocation ability to create new and improved metaheuristic algorithms (Sugiarta et al., 2019).

Binary Bat Algorithm
Binary bat algorithm or BBA is a continuation of the bat algorithm. In the bat algorithm, the algorithm works well for continuous-valued problems, that is, each bat has a continuous-valued position in a certain search space. However, in discrete and combinatorial cases it is recommended to change the algorithm called binary bat (Gupta et al., 2019). In BBA, artificial bats can move around the search space by utilizing position and velocity vectors that are updated in a continuous state (Ma & Wang, 2018).

K-Nearest Neighbors Algorithm
K-Nearest Neighbor or KNN is one of the classification methods in data mining, where KNN can classify datasets based on training data that are classified or labelled. KNN is included in the supervised learning group, namely the results of the newly classified query based on most of the proximity to the categories in KNN (Dewi & Dwidasmara, 2020).
This algorithm is based on the shortest distance from the test data to the training data to determine it. Then the most data is taken to be used as a prediction from the testing data. Near or far neighbors are calculated using eucledian distance (Cahyanti et al., 2020).

BA-KNN Algorithm
BA-KNN is a combination of algorithms between BBA and KNN. BBA can improve the accuracy of KNN by finding and selecting better attributes. The result of feature selection from BBA can be used for KNN classification. This classification will result in better accuracy. In addition, due to the few features that are used, the classification required for computation time is faster (Sugiarta et al., 2019).
The basis of BA-KNN is to use the accuracy of KNN used for the fitness function of the BBA algorithm. Therefore, BA-KNN is a BBA algorithm but uses the accuracy of KNN classification with selected features for fitness function retrieval. BA-KNN can be considered as an optimization variation of the optimized BBA for feature selection purposes (Gupta et al., 2019).

Method
In this study, the first stage in the research method is to prepare the research object and normalize the data for the research object to be applied to the BA-KNN algorithm. Then do the classification using the BA-KNN algorithm. After being applied to the algorithm, the next step is to compare which algorithm gives better accuracy results between the KNN model before using feature selection and after. Overall, the research method can be seen in the system flowchart in figure 1.

Results and Discussion
This section is divided into two parts, results and discussion. The results are a description of the data and findings obtained using the methods and procedures described in the data collection method. The discussion is an explanation of the results that answer research questions more comprehensively.

Results
The application of the BA-KNN algorithm to predict the life expectancy of lung cancer patients after thoracic surgery has five stages of research. The five stages include the data collection stage, the data processing stage, the model evaluation stage, the classification stage, and the system implementation stage. A more complete explanation regarding the results of the research stages will be described as follows.

Data Collection
In this study, the data used to be processed in the study is secondary data obtained from the Machine Learning Repository University Irvine California (UCI). Wroclaw Thoracic Surgery Center is secondary data used in this study which contains a collection of data, the contents of the dataset are patient data from 2009 to 2014 with lung cancer patients who underwent thoracic surgery.
This dataset has 16 attributes about each patient that represent the condition before and after the patient underwent thoracic surgery. The data types in the dataset are binary, numeric, and nominal. This dataset has two classes, namely, surviving and dying within one year (die) with a total sample of 400 samples for the survival class and 70 for the die class. Table 1 shows the 16 attributes used along with their descriptions and data types.

Normalization Data
Data normalization is done to balance data values by mapping data into certain ranges. The data normalization process is carried out using the Min-max Normalization calculation. The following steps are used for data normalization.

Finding the minimum and maximum values for each attribute
In the process of normalizing the data required minimum and maximum values. The minimum and maximum values of the attributes are shown in table 2.

Calculate the value of each attribute
The next step is the process of calculating the value itself using normalization data equation.
Following are the results of normalization based on the 4 selected attributes shown in table 3.

Splitting Data
The distribution of data in the research dataset aims to divide the data into training data and testing data. This split data stage uses k fold cross validation, with divide the data 0.2 or 20% will be used as testing data and 0.8 or 80% is used as training data with a k-fold value of 5.

Data Mining
At the data mining stage, there are two mining processes. First, the classification process using the KNN Algorithm on the Thoracic Surgery Prediction Dataset. Second, the classification process using the bat algorithm for feature selection, and KNN algorithm for classification. In this study, the BA-KNN algorithm carried out 3 tests, namely testing the number of populations, testing convergence and testing KNN comparisons.

Population test
The population size test is a test that aims to determine the best value of a population variable on the BA-KNN for the thoracic surgery dataset. The population variable in this test is the number of bats in the BA-KNN algorithm. Where each population will find the best solution, the more population values, the more bats can find a solution to the problem. In this test, it is done by changing the value of the population parameter for the number of bats in the BA-KNN algorithm. In this test, it is done by changing the parameter value of the population with values of 2, 4, 6, 8, and 10. The results of the accuracy and selected features with changes in the population size parameter shown in table 4.  ', 'Forced_Expiration', 'Zubrod_scale', 'Cough', 'Weakness', 'Size_of_tumor', 'Asthmatic'] 8 87.23 % ['Diagnosis ','Forced_Capacity','Zubrod_scale','Pain','Dyspnoea','MI_6months','PAD'] 10 87.23 % ['Zubrod_scale','Pain','Cough','Weakness','Size_of_tumor','diabetes','MI_6months'] Based on table 4, it is shown that the test results with several parameter values of the population and the results of the BA-KNN test by testing the number of populations get the highest accuracy of 87.23% for the total population of 2, 6, 8, and 10 by selecting from sixteen features to seven features. Selected with one of the highest accuracy values. Seven features selected from one of the highest accuracy scores were 'Zubrod_scale', 'Pain', 'Cough', 'Weakness', 'Size_of_tumor', 'diabetes', 'MI_6months'.

Convergent Test
Convergent testing is a test to determine the convergence of BA on BA-KNN, convergent conditions are where BA has found a solution and the value of the solution does not change from several iterations (Sugiarta et al., 2019). Changes in parameter values from the maximum iterations carried out in this test. The maximum parameter values for iterations of the BA-KNN algorithm tested are 2, 4, 6, 8, and 10. The results of the accuracy and selected features with changes in the maximum iteration parameter shown in table 5.  ', 'Forced_Expiration', 'Pain', 'Cough', 'Size_of_tumor', 'diabetes', 'PAD'] Based on table 5, the results of the BA-KNN convergent test are shown with the first iteration 2 achieving 86.17% accuracy, the second iteration 4 achieving 87.23% accuracy, the third iteration 6 achieving 87.23% accuracy, the fourth iteration 8 achieving 87.23% accuracy, and the last iteration of 10 achieved an accuracy of 87.23%. In the convergent test, the highest accuracy achieved during the 4th, 6th, 8th, and 10th iterations with an accuracy of 87, 23% by selecting from sixteen features to seven selected features. Seven features selected from one of the highest accuracy scores were 'Forced_Capacity', 'Forced_Expiration', 'Pain', 'Cough', 'Size_of_tumor', 'diabetes', 'PAD'.

KNN Comparative Testing
The KNN comparison test is a test by changing the value of the k parameter on KNN and BA -KNN which aims to determine how influential BA is on BA -KNN by comparing the execution time and accuracy values of BA-KNN and KNN without the feature selection of the bat algorithm. This test makes changes to the value of the parameter k with values of 3, 5, 7, 9, and 11. The results of the accuracy and selected features with changes in the population size parameter shown in table 6.

Discussion
This study applies the bat algorithm as a feature selection to improve accuracy in predicting patient life expectancy after thoracic surgery. The classification algorithm used is the KNN algorithm. The dataset used in this study is a dataset obtained from the UCI Machine Learning Repository, namely the thoracic surgery dataset. In this study, a comparison was made between the KNN algorithm, the BA-KNN algorithm and the results of previous studies. The higher the accuracy of the model, the better the model used. The comparison results show that the accuracy obtained from the BA-KNN algorithm model is the second highest accuracy after the J48 + Naïve Bayes algorithm with an increase in accuracy of 1.92% from the research of Setyadi et al (2020), 2.12% from the research of Prasetio & Susanti (2019) and the comparison of the results of the classification system made for KNN is 5.23%. The J48 and Naïve Bayes algorithms have the highest accuracy assisted by attribute selection with the WEKA application using a ranker algorithm to sort attributes from 1 to 16 attributes while the BA-KNN algorithm uses the bat algorithm as a feature selection algorithm, but the drawback is that the feature selection is chosen randomly.
The advantage of this study is that by applying the BA-KNN algorithm with the bat algorithm as a feature selection algorithm, it can increase accuracy in predicting life expectancy of patients after thoracic surgery so that it can be used by further research as a reference in conducting research. However, this research still has drawbacks, namely the application of the bat algorithm for feature selection does not necessarily get the optimal solution, because the results obtained can vary even though using the same parameters. However, the application of the bat algorithm has succeeded in being the best solution for optimizing the KNN model in terms of feature selection and execution time used in this study.

Conclusion
Based on the results of research and discussion related to the optimization of KNN with the bat algorithm as feature selection to increase accuracy in predicting life expectancy of patients after thoracic surgery using the thoracic surgery dataset obtained from the UCI Machine Learning Repository. Three tests of the BA-KNN algorithm used in predicting the life expectancy of patients after thoracic surgery using a thoracic surgery dataset that has been carried out, namely population testing, convergent testing and KNN comparison testing, the best accuracy results are 87.23% which has an execution time of 0.01209. seconds with an increase in accuracy of 5.32% from the accuracy of KNN without optimization of the bat algorithm, which is 81.91% which has an execution time of 0.37330 seconds.