Optimization of Classification Accuracy Using K-Means and Genetic Algorithm by Integrating C4.5 Algorithm for Diagnosis Breast Cancer Disease

Technological development resulted in data proliferating. The data is processed into valid information for daily needs. Data mining is a technique to convert data into useful information. Data mining has been widely used in performing prediction functions, for example, health and medical science. This study using Wisconsin Diagnostic Breast Cancer dataset taken from UCI Machine Learning Repository. The dataset has 32 attributes with 569 samples. This data has a continuous and high dimensional data type, and it makes the C4.5 algorithm need long computation time and extensive storage. This study aims to improve the accuracy of the C4.5 with a combination of K-Means and Genetic Algorithm. These study results compared the accuracy of the C4.5 algorithm before and after applying the combination of K-Means and the Genetic Algorithm for diagnosing breast cancer. The accuracy of C4.5 is 91,228%. Meanwhile, the accuracy of C4.5 after optimized using the K-Means and Genetic Algorithm is 94,824%, with the average number of features are selected 22 features. Thus, the application of K-Means and Genetic Algorithm on the C4.5 Algorithm can improve the accuracy of diagnosing breast cancer by 3,596%.

Clustering is a process of grouping data into classes or groups so that objects in a group have high similarities compared to others (Karegowda et al., 2011). One of the clustering algorithms is the K-Means Algorithm. Class classifications involving high-dimensional data affect the computation of time and storage from the data processing stage that affect classification accuracy. A data dimension reduction method is used to handle high-dimension data, which is commonly called the selection feature (Talita, 2016).
The feature selection algorithm is part of the preprocessing data. The selection feature is also helpful for facilitating high-dimensional data processing (Wahyuni, 2016). Generally, the selection features are categorized into three categories, namely wrapped, embedded, and filter methods. The wrapped method requires ample computation time, memory space, and additional algorithms to produce the best subset (Talita, 2016). The filter method requires fast computation time to select features based on data characteristics, so it is not necessarily finding the best subset. The embedded form combines wrapped and filters to find the best combination of feature subsets (Boomert, Sun, & Bischl, 2020).
One of the wrapped algorithms that can optimize for accuracy is the Genetic Algorithm (Zamani, Amaliah, & Munif, 2012). Genetic algorithms are algorithms for solving solutions to problems based on the principles of natural selection in genetic science. When selecting features, the Genetic Algorithm is used as a random selection algorithm to explore large spaces (Arifudin, 2012). The purpose of the Genetic Algorithm is to choose the optimal value for weight by maintaining a population that has a good fitness value to produce offspring and form a new population (Alalayah, Almasani, & Qaid, 2018).
This study proposes combining the K-Means method and the Genetic Algorithm to optimize the classification C4.5 Algorithm in diagnosing breast cancer disease.

Methods
This study uses a combination K-Means and Genetic Algorithms as feature selection. K-Means is applied to solve continuous data that usually occurs in classification problems. Genetic algorithms are applied to find the best features based on the best fitness value at maximum generation. The combination of K-Means and Genetic Algorithm is used to improve classification accuracy. The classification method is the C4.5 algorithm. This study will determine the comparison of accuracy before and after the application of K-means and Genetic Algorithm in the C4.5 Algorithm. The flowchart of the proposed method is shown in Figure 1.

Data Preprocessing
This study using a public dataset, Wisconsin Diagnostic Breast Cancer, available in the UCI Machine learning repository. This dataset has ten attributes, where each feature has a criteria value calculated from each image. The attribute criteria are mean, standard error, and worst, thus make the overall attribute 30 and other attributes, namely ID and class. The dataset description can be seen in Table  1.

K-Means
K-Means is a clustering algorithm. K-Means will divide the data into groups. This method is included in unsupervised learning, where the input received is data or objects and the number of k-clusters. The information is grouped based on the center point's value (centroid), representing the cluster or group.
K-means will classify existing data based on common characteristics. The k-means stages are as follows: 1. Determine the number of k-cluster. 2. Determine centroid randomly. 3. Calculate the distance between the data and the centroid using Equation 1.

=1
(1) Information: D(x2, x1) : The data dimension x1 : Position of the cluster center x2 : Position of the data object 4. Group data based on data distance with the centroid. 5. Determine the value of the centroid using Equation 2.

Genetic Algorithm
A genetic algorithm is an evolutionary method that solves problems using a random way. Inspired by natural selection, this method causes the variation to be collected in one direction, resulting in process optimization (Ashari, Muslim, & Alamsyah, 2016). The stages of the genetic algorithm in feature selection are as follows: 1. Initialize individuals to know what type of data will be used to calculate what will be represented on each chromosome. 2. Evaluate the fitness value of each particle in the population. 3. Selection, select chromosomes as prospective parents based on the fitness value of each chromosome. Chromosomes that have a good fitness value will be maintained. 4. Recombination produces a new chromosome with better fitness values than the previous chromosomes. 5. Mutation process, changes the value of genes on a chromosome. 6. Update the old chromosome value with the fitness value of the new chromosome. 7. Stop iteration if the best fitness value or maximum generation is met. Otherwise, go back to step 2.

C4.5 Algorithm
The C4.5 algorithm improves the IDE3 algorithm developed in 1986 by Quinlan Ross (Kathija, Nisha, & Sathik, 2017). In the C4.5 Algorithm, selection attribute using Gain, Ratio, by searching the entropy values (Sunge, 2018). C4.5 uses a decision model to determine the attributes that become the root by looking at the highest gain values of the existing attributes (Wibowo, Manongga, & Purnomo, 2020). The stages of the C4.5 algorithm in classifying the dataset are as follows: 1. Prepare training data that have been grouped into specific classes. Training data consist of 80% of the entire dataset. 2. Calculate the entropy value using Equation 3.
5. Repeat step 2 until all records are partitioned 6. The decision tree partitioning will stop when: a. There are no attributes partitioned b. There is no record in the empty branch

Results and Discussion
This study uses the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, a public dataset from the UCI Machine Learning Repository, where this dataset has 32 attributes with 1 id, 1 class, and 30 attributes. This study uses algorithms that have been proposed and tested on the process. Test using the Python programming language and at the same time using libraries available in Python language. The results of this study are classification accuracy in diagnosing breast cancer.

Results
This study is divided into three applications. The first is to test the classification C4.5 algorithm. The C4.5 algorithm will process the WDBC dataset with 30 attributes based on the value of the gain ratio to produce accuracy in the percentage. The accuracy results from the classification C4.5 can be seen in Table 2. At the classification stage, the C4.5 algorithm produces an accuracy of 91,228 %. This process proves that the C4.5 can classify WDBC datasets with continuous data types well but can still be improved using preprocessing algorithms.
The second application is a combination of algorithm C4.5 with K-Means to process the dataset WDBC. K-Means will create groups or clusters on attributes that have continuous data. In the K-Means process, determining the number of k-clusters will affect accuracy. The results of the accuracy of the model combinations can be seen in Table 3. The third application is the C4.5 algorithms will be combined with K-Means and Genetic Algorithms. The genetic algorithm will search for the best features based on the best fitness values in this process. This test was carried out ten times. The application of this combination will result in the best parts to be selected and the classification accuracy. The results of the accuracy of the model combinations can be seen in Table 4.

Discussion
This study's purpose is to determine the process and result of the accuracy of combination K-Means and Genetic Algorithm on C4.5 algorithm for diagnosis breast cancer disease. K-Means are applied to attributes that have a continuous data type. The choice of k-cluster will affect the accuracy. Based on experiments conducted with k=2,3,4, and 5. In this case, k=2 has the best accuracy in optimizing the C4.5 algorithm.
The following process is selecting the best feature using Genetic Algorithm based on the best fitness values. The accuracy obtained from this combination is 94,824%. So these results state that this combination can increase the accuracy of the C4.5 algorithm by 3,596%. The comparison of the accuracy results before and after the application algorithm can be seen in Figure 2. Based on the study result from the application of a combination of K-Means and Genetic Algorithm to optimize the accuracy of the C4.5 algorithm for diagnosis of breast cancer disease that has been done, it can be seen that the comparison of the accuracy before and after C4.5 is optimized using a combination of K-Means and Genetic Algorithm. In contrast, the comparison of this study with the previous research can be seen in Table 5. The advantages of the study through this model by applying a combination of the K-Means and Genetic Algorithm as selection feature in the classification of the C4.5 algorithm can increase the accuracy for diagnosing breast cancer disease. While the study's weakness is that the accuracy result tends to change because it depends on the initialization for determining the fitness values of the Genetic Algorithm is selected randomly.

Conclusion
In this study, the application of the C4.5 algorithm combines with K-Means and Genetic Algorithm as a feature selection for the diagnosis WDBC dataset obtained from the UCI Machine Learning Repository. K-Means implemented to handle the problem of continuous data on the attributes of the WDBC dataset. At the same time, the Genetics Algorithm is applied to choose the best features based on the best fitness value. This study resulted in the accuracy of the application of the algorithm C4.5 is 91,228%. The preprocessing algorithm that can improve the results of the C4.5 algorithm is to use a combination of K-Means where the best k-cluster is k = 2 and combine with Genetics Algorithm as feature selection resulting in an accuracy of 94.824%. Thus, this study can be concluded that applying the K-Means and Genetics Algorithm at C4.5 algorithm can improve results accuracy in diagnosing breast cancer by 3,596 %.