The Improvement of C4.5 Algorithm Accuracy in Predicting Forest Fires Using Discretization and AdaBoost

ABSTRACT

al., 2014). However, negative factors from the dataset, such as noise, missing values, and inconsistent data, significantly affect the method's success. Thus, preprocessing data using the preprocessing discretization method is used on the dataset to obtain a final data set that can be considered correct and valid for different data mining algorithms (Garcia et al., 2016). Discretization is a method that aims to reduce the number of distinct values for a given continuous variable by dividing the ranges into a finite set of separate intervals and then associating these intervals with meaningful labels, thereby reducing system memory demands and increasing algorithm efficiency (Dash, Paramguru, & Dash, 2011). Besides using the preprocessing method, the boosting method with Adaptive Boosting (AdaBoost) can be combined with other classifier algorithms to improve classification performance (Listiana & Muslim, 2017). The boosting method is a machine learning method that changes weak classifiers to stronger classifiers (Rahim, Paulraj, & Adom, 2013).
This study aims to determine the increase in accuracy of the C4.5 algorithm before and after AdaBoost and discretization in predicting forest fires. Discretization and Adaptive Boosting (AdaBoost) methods are proposed to improve the accuracy of the C4.5 algorithm in predicting forest fires using the dataset obtained from the Machine Learning Repository of UCI.

Methods
This study combined discretization and AdaBoost methods with the C4.5 algorithm to improve the accuracy of forest fire predictions. The forest fire dataset used in the current study is from the machine learning repository of UCI (https://archive.ics.uci.edu/ml/datasets/Forest+Fires). This dataset comes from the Department of Information Systems, University of Minho, Portugal.
In this study, the authors only used eight Attributes: FFMC, DMC, DC, ISI, temp, RH, Wind, and Rain (Shidik et al., 2014). Forest Fire Attribute Dataset can be seen in Table 1. The discretization method is applied to the forest fire dataset. The dataset is then divided into two data classes using rules derived from the following categories: small and large, referring to normalized values. The normalization rules can be seen in equations 1 and 2 (Yu et al., 2011). The following categories: small, and large which refer to normalized values. The normalization rules can be seen in equations 1 and 2 (Yu et al., 2011).  The formula for calculating normalization can be seen in equation 3.
By using k-fold cross-validation, the dataset is further divided into training data and test data. Then AdaBoost divides the training data into a subset of 10 iterations of data. Furthermore, the C4.5 algorithm is used to perform classification. The final result will get an output, namely a configuration matrix, to calculate accuracy. The level of accuracy is taken from the highest accuracy value of ten k-fold cross-validations. The flowchart method used in this study can be seen in Figure 1.

Discretization
Discretization is the process of converting a continuous attribute value into several finite intervals and associating with each interval the discrete and numeric values (Al-Ibrahim, 2011). Discretization aims to reduce the number of different values for a given continuous variable by dividing the ranges into a finite set of separate intervals and then associating these intervals with meaningful labels (Dash et al., 2011).
The process of discretization is to find the number of discrete intervals, and then the width, or the boundary for the gaps, gives a continuous range of attribute values. In this study, the data were divided into two intervals. In the first interval, the data is labeled 0, and in the second interval, it is labeled 1.
Discretization is seen as the partition of a continuous-valued attribute into sequential discrete attributes with some discrete intervals, which is equivalent to reducing the number of states of a sequential discrete random variable by combining their multiple forms.

C4.5 Algorithm
One of the algorithms that can be used to make a decision tree (decision tree) is the C4.5 algorithm. The C4.5 algorithm was introduced by Quinlan, which is an improvement over the ID3 algorithm. The C4.5 algorithm model tree is built by dividing the data recursively until each part consists of data from the same class. The first process of the C4.5 algorithm is to perform calculations to obtain global entropy from the forest fire dataset with equation 4. The attribute that has the most significant information gain is then used as the root node. Then split the dataset based on the branch at node 1 with equation 6.
The calculation is then repeated until all attributes have data classes. The decision tree partitioning process will stop when all the records in node N are assigned the same class, no attributes in the record are partitioned anymore, and there are no records in the empty branch.

AdaBoost
Boosting is an approach to machine learning to improve accurate predictions by combining many relatively weak and inaccurate rules (Nurzahputra & Muslim, 2017). AdaBoost sets the initial distribution on the training set and then repeats it until the stop criterion is reached using adaptive weights (Kim & Upneja, 2014).
The steps in the AdaBoost algorithm are as follows:

Results
In this research, before entering the primary data mining process, the data is processed in the preprocessing process. This process is crucial for preparing relevant data so that the data mining process produces high accuracy. Data normalization, data transformation, and discretization are used to perform data preprocessing in research. The forest fire dataset used is shown in Table 2. Data transformation is performed to transform the dataset class to make processing easier. The data class is indicated in the area attribute. The area attribute is divided into two categories using the following categories: small and large, which refers to the normalized value. These data transformation rules are applied to the forest fire dataset. The results of the transformation are shown in Table 3.
The second application is a combination of algorithm C4.5 with K-Means to process the dataset WDBC. K-Means will create groups or clusters on attributes that have continuous data. In the K-Means process, determining the number of k-clusters will affect accuracy. The results of the accuracy of the model combinations can be seen in Table 3. After the data were transformed, discretization was applied to reduce the number of different values in a given continuous variable by dividing the ranges into a finite set of separate intervals and then associating these intervals with meaningful labels. In this study, discretization divides the value of each attribute into two intervals. Furthermore, each interval is labelled with the names 0 and 1. Table 4 shows the forest fire dataset after discretization.  The data mining process then carries out data that has gone through the preprocessing stage. At this stage, the data mining process is carried out twice. The first process is the classification process of the C4.5 algorithm using a dataset without discretization. The second process is the classification of the C4.5 algorithm combined with AdaBoost utilizing a dataset that has been discretized.
The dataset of forest fires was classified using the C4.5 algorithm without first being discretized. Next, the results were validated using 10-fold cross-validation. With the cross-validation method, the forest fire dataset was then divided into two small training and test data. The training data was processed using the C4.5 algorithm to produce a tree model. The tree model was then tested using testing data. The confusion matrix was used in the current study to measure algorithm performance. The results of the accuracy of the C4.5 algorithm using ten k-fold cross-validations are shown in Table 5. The first experiment shows the dataset of forest fires without discretization was classified using the C4.5 algorithm with the validation method, namely k-fold cross-validation with a value of k=10. The forest fire dataset is divided into two, namely training data and test data using cross-validation. The training data is processed using the C4.5 algorithm to produce a model tree. The model tree was tested using test data. The confusion matrix is used to measure algorithm performance. The results of the accuracy of the C4.5 algorithm using ten k-fold cross-validations are shown in Table 5.
The results of the classification process were then compared with the classification using the C4.5 algorithm combined with AdaBoost and discretization. Applying the C4.5 algorithm to the forest fire dataset produces the highest accuracy of 84.62% resulting from ten k-fold cross-validations.
In the second experiment, the dataset of forest fires that had been discretized was classified using the C4.5 algorithm and AdaBoost with the validation method, namely k-fold cross-validation with a value of k=10.
The forest fire dataset is divided into training data and testing data using random cross-validation. Then on the training data, there is data boosting by AdaBoost. The training data is processed with the C4.5 algorithm, and a tree model is obtained. Then the tree model is boosting iterations with a value of i≤10 so that it gets a better model tree in the next iteration. The model tree is then tested using test data. The confusion matrix is used to measure algorithm performance. The results of the accuracy of the C4.5 algorithm with AdaBoost and discretization using ten k-fold cross-validations are shown in table 6. The application of classification using the C4.5 algorithm combined with AdaBoost and discretization on the forest fire dataset produces the highest accuracy of 98.04% resulting from ten k-fold cross-validations.

Discussion
The accuracy of the C4.5 algorithm in predicting forest fires using discretization and AdaBoost has three stages. The first stage is data collection, the second stage is data processing, and the third stage is the data mining process.
Prediction of forest fires using the C4.5 algorithm classification method obtained an accuracy of 84.62% obtained from the results of ten k-fold cross-validations, while forest fire prediction using the C4.5 algorithm classification method with AdaBoost and datasets that have been through the discretization process obtained an accuracy of 98.04% obtained from the results of ten k-fold crossvalidations. Based on the results of the accuracy of the two data mining processes, the application of discretization and AdaBoost in the decision tree method of the C4.5 classification algorithm can increase the accuracy by 13.42% in the prediction of forest fires.
With the level of accuracy given, this model can be proven to predict forest fires in the UCI Machine Learning Repository forest fire dataset. A comparison was made with previous studies using the same dataset and method to find out that this method is better than the existing methods. A comparison table for the accuracy of forest fire prediction is shown in Table 7. In this study, the authors applied discretization and AdaBoost to the C4.5 algorithm. This study has higher accuracy than previous research conducted by Xie & Peng (2018). In this study, the decision tree technique used in the forest fire dataset resulted in an accuracy of 70.04%. This significant increase occurs because the discretization process can reduce system memory demands, increase the efficiency of data mining algorithms, and make the knowledge extracted from discretized datasets more concise, easy to understand, and use. This also happens because of the implementation of AdaBoost, which trains the data to get a classification model that is stronger than the classification model obtained from the C4.5 algorithm process so that this method can improve the accuracy of forest fire predictions by using the C4.5 algorithm.

Conclusion
Based on the research results, the increase in the accuracy of the C4.5 algorithm in predicting forest fires using discretization and AdaBoost shows that the preprocessing discretization and AdaBoost methods can improve the results of the C4.5 algorithm in predicting forest fires. From the classification process of the forest fire dataset using the C4.5 algorithm, an accuracy of 84.62% was obtained, while after adding discretization and AdaBoost, an accuracy of 98.04% was obtained. The use of discretization and AdaBoost in the C4.5 algorithm succeeded in increasing the accuracy by 13.42% compared to only using the C4.5 algorithm.