Optimization of the C4.5 Algorithm by Using a Genetic Algorithm for the Diagnosis of Life Expectancy for Hepatitis Patients

ABSTRACT

As technology develops rapidly, the amount of data generated experiencing rapid development, including medical data. Data can help diagnose the life expectancy of people with the disease such as hepatitis using data mining methods in the medical field. In this research, technique data mining uses a classification technique with the C4.5 algorithm and the UCI Machine Learning Repository dataset. This dataset has 19 attributes, 1 class, and 155 samples. C4.5 algorithm is optimized using the Genetic Algorithm feature selection process. This study compares the accuracy of the C4.5 algorithm before and after optimization using a Genetic Algorithm. C4.5 algorithm produces the highest accuracy of 96.23%. Meanwhile, the C4.5 algorithm, after being optimized using Genetic Algorithm, has the highest accuracy of 98.11%. The number of features selected is 15 features. Application of Genetic Algorithms in C4.5 algorithm is proven to improve the accuracy in diagnosing life expectancy of people with hepatitis as much as 1.88%.
The most powerful and widely used classification technique for classification and prediction is the decision tree (Perveen et al., 2016). One of the algorithms developed by the decision tree is a C4.5 Algorithm. The C4.5 algorithm can predict minimum execution time and the best accuracy result (Muslim et al., 2018). There are several ways to improve the accuracy results. One of them is done data preprocessing stage. Preprocessing techniques can enhance data quality and strengthen yield accuracy because data quality determines method performance prediction and usefulness of extracted knowledge (Asgarnezhad, Shekofteh, & Boroujeni, 2017).
In the data preprocessing stage, there is an attribute selection process (feature selection). The attribute selection process also plays an essential role in data mining. The attribute selection method is a crucial procedure in pattern recognition that contributes to improved classification model performance (Eid & Abraham, 2018). In medical data, the attribute selection process helps select relevant attributes so that the specified attributes can contribute to diagnosing disease and providing more accurate accuracy (Aini, Sari, & Arwan, 2018). The attribute which is not considered an important attribute (irrelevant, redundant) will be deleted. The elimination of unnecessary attributes aims to reduce high dimensional data computing workload to speed up the calculation of objective functions in the classification process. The attribute selection process is an important factor in increasing the accuracy of the classification process (Liu et al., 2011). One of the algorithms that can be used for attribute selection is the Genetic Algorithm. The genetic algorithm was chosen because it can reduce data attributes. Data that initially has many attributes is reduced to several attributes with less information without reducing the data (Nugroho, Nhita, & Trantoro, 2016). This study proposes an optimization of the classification algorithm C4.5 by selecting the Genetic Algorithm method to diagnose life expectancy hepatitis disease.

Methods
This study uses the Genetic Algorithm method as a selection of valuable features to optimized the performance of the C4.5 algorithm. Genetic Algorithms are applied to looking for the best attributes based on the best fitness value in that generation has been determined. This study will assess the comparison of accuracy before and after applying the Genetic Algorithm to the C4.5 algorithm. Flow chart The C4.5 Algorithm with the Genetic Algorithm is shown in Figure 1.

Data Preprocessing
This study used a public dataset, namely hepatitis, available at the UCI Machine learning repository. This dataset has 20 attributes, including one attribute class and 155 instances, where six attributes are numeric, and 14 attributes are nominal. The description of the dataset can be seen in Table 1.

Stage of Preprocessing Nominal Attribute Data to Binary
This stage converts the attribute data with nominal types to a binary attribute with values 0 and 1.

Stage of Cleansing Data
In the dataset used in this current study, there are missing values. The missing values are caused by attribute data loss for various reasons, such as medical events, cost savings, anomalies, and so on. With regards to this issue, it is necessary to process the missing value data. In this study, filling in the missing values is done by replacing the missing values with the values obtained from the maximum number of frequencies in one attribute.

Stage of Data Normalization
Data normalization is a processing stage in which attribute data are scaled to fit within a smaller specific range, such as a range between [0-1] or [-1-0]. In this study, the normalization of intermediate xi data has used a range [0-1] where max (x) is the maximum value of attribute data, and min (x) is the minimum value of attribute data as in Equation 1.

Stage of Data Normalization
The stages of the Genetic Algorithm in attribute selection are as follows.
1. Individual representation to know what type of data will be examined for further processing into the coding scheme. Scheme This coding will later represent each chromosome will be researched.
2. Evaluate the fitness value of each particle in the population. 3. Selection to select chromosomes as parents based on candidates the fitness value of each chromosome. Chromosome which has a value of fitness that well will be maintained. 4. Recombination. After the chromosomes are selected as parents, the recombination is often called the cross-over, aims to produce new (offspring) chromosomes with fitness values better than the previous chromosome. 5. Mutation process changes the value of genes on a chromosome. 6. Update the old chromosome value with the fitness value of the new chromosome.
If the best fitness value or maximum generation is met, stop the iteration. Otherwise, go back to step 2.

C4.5 Algorithm
C4.5 algorithm refinement of the IDE3 algorithm developed by Quinlan Ross in 1986 (Kathija, Nisha, & Sathik, 2017). In the C4.5 algorithm, attribute selection is made using Gain, Ratio, by searching Entropy value. The stages of the C4.5 algorithm in classifying the dataset are as follows: 1. Prepare training data that have been grouped into certain classes. Training data consist of 66% of the entire dataset.
5. Repeat step 2 until all records are partitioned 6. The decision tree partitioning will stop when: a. There are no attributes partitioned b. There is no record in the empty branch

Results and Discussion
This study uses a hepatitis dataset, a public dataset of UCI Machine Learning Repository, where this dataset has 20 attributes with 1 class, 19 attributes, and 155 samples of data. This research was done using the PHP programming language with a framework Laravel. The results of this study are classification accuracy in diagnosing the life expectancy of people with hepatitis.

Results
This research is divided into two applications, and the first is testing classification algorithm C4.5. The C4.5 algorithm will process the hepatitis dataset based on the gain ratio value. The accuracy results of the C4.5 algorithm can be seen in Table 2. At the C4.5 algorithm classification stage, without the feature selection process, the highest accuracy is 96.23%. The second application is optimizing the C4.5 algorithm with Genetic Algorithms as feature selection. The parameters used as the initialization of the Genetic Algorithm process are as follows: : 19 e. Total Population : 150 As for the results of that accuracy obtained are shown in Table 3. The highest accuracy result of the C4.5 algorithm after applying the Genetic Algorithm is 98.11%, and the number of selected features is 14.

Discussion
This study aims to determine how it works and the results of its accuracy obtained from optimizing the C4.5 algorithm using the feature selection algorithm Genetics in the diagnosis of life expectancy of people with hepatitis. Election features are carried out based on the best fitness value by paying attention to parameters from genetic algorithms such as maximum generation, population size, probability cross-over, and mutation probability. After going through the pre-processing stage is a classification process using the C4.5 algorithm. From that result obtained, there is an increase in accuracy results. Comparison of accuracy results Algorithm C4.5 before and after optimized using Genetic Algorithms in Figure 2.

Figure 2. Increased accuracy
The composition between the training and testing data was chosen randomly, resulting in the accuracy results obtained using the C4.5 algorithm changing in each experiment. While the highest accuracy results obtained by applying the Genetic Algorithm are 98.11%, and the number of features used is 14 features. Based on the experiments conducted, there was an increase in accuracy using the Genetic Algorithm to classify the C4.5 algorithm to diagnose the life expectancy of people with hepatitis as much as 1.88%. The accuracy results in the system show the level of accuracy in determining a class consisting of die or live classes in the system. The comparison of this study with previous research can be seen in Table 4. The advantage of research through this model is that applying the Genetic Algorithm as a selection feature in the classification of the C4.5 algorithm can improve accuracy in diagnosing the life expectancy of people with hepatitis. While the drawback of this research model is that the accuracy results tend to fluctuate because it depends on the initial initialization to determine the fitness value of the Genetic Algorithm to be chosen randomly.

Conclusion
In this study, the application of the C4.5 algorithm is combined with the Genetic algorithm as a feature selection for the diagnosis of hepatitis dataset obtained from the UCI Machine Learning Repository. Genetic algorithms are applied to select the best features based on the best fitness values. This research resulted in the accuracy of the application of the C4.5 algorithm of 94.15%. This research produces the highest accuracy from the application of the C4. algorithm can be improved again by the preprocessing algorithm, namely using the Genetic algorithm as a feature selection of 98.11%, and the number of selected features is 14. In this study, it can be concluded that applying the Genetic algorithm to the C4.5 algorithm can improve the accuracy of diagnosing hepatitis by 1.88%.