Optimization of the C4.5 Algorithm Using Particle Swarm Optimization and Discretization in Predicting the Results of English Premier League Football Matches

ABSTRACT


Introduction
The English Premier League is the top level of the men's English football league system held by the football association in England, namely The Football Association (Prabowo, 2020).The English Premier League is one of the leagues with the most fans in the world.This can be seen from the number of followers on the English Premier League club's Instagram account.Based on Instagram data, in April 2022, Manchester United had 57.7 million followers, then Liverpool had 38 million followers, and Chelsea had 33.9 million followers.So that the results of the English league matches are very interesting to make a prediction to find out the results of the match.Research in predicting the results of football matches has been done previously using Machine Learning (Alfredo & Isa, 2019;Baboota & Kaur, 2019;Prabowo, 2020;Razali et al., 2017;Zhang et al., 2021).
Machine Learning can be used to make predictions on a dataset using data mining classification techniques (Yuliani, 2021).Classification is a technique that can be used to predict data or describe data classes (Alamsyah & Fadila, 2021).In data mining, a popular classification algorithm for analyzing Decision Tree-based statistical data is the C4.5 algorithm (Muslim et al., 2018).The C4.5 algorithm has been used to predict data (Prihanditya & Alamsyah, 2020).The algorithm is used to classify the results of the English Premier League matches based on the dataset obtained from the statistics of the results of the English Premier League matches.The game results are home wins, away wins and draws.Based on research that has been done previously, there are obstacles in the selection of attributes and the data used has a continuous value so that the algorithm used for the classification process does not run optimally (Baboota & Kaur, 2019).
In this study, the classification process will be carried out using a Decision Tree-based algorithm, namely the C4.5 algorithm.The C4.5 algorithm is one of the Decision Tree algorithms that has the ability to produce the best prediction accuracy and requires a minimum execution time (Saputra, 2020).Then this research performs a preprocessing process to get data that is easily accessible by the system and more concisely using the Discretization method.The Discretization method is able to divide the data into a range so that the data used becomes more concise (Dash et al., 2011).After the preprocessing process, a feature selection was applied to reduce attributes that did not increase the results of classification accuracy (Lestari & Alamsyah, 2020).A feature selection process is carried out to overcome attributes that do not have an important influence on the classification results.The feature selection process uses Particle Swarm Optimization because this method is a more optimal method compared to genetic algorithms, especially in the field of optimization (Muslim et al., 2018).Particle Swarm Optimization is also able to overcome the problem of imbalance in the dataset.The imbalance of the dataset can be derived from the algorithm used in the classification process (Fanny & Cenggoro, 2018).

The Proposed Method
2.1 C4.5 Algorithm The C4.5 algorithm is one of the Decision Tree algorithms in data mining techniques, namely a classification that is interpreted into a decision tree and decision rules (Hermanto & Azhari, 2017).The C4.5 algorithm includes a continuation of the ID3 algorithm with several developments such as processing continuous data and being able to overcome missing values (Irena & Setiawan, 2020).The C4.5 algorithm is able to handle data that has numeric and discrete values (Irawan, 2021).
The performance of the C4.5 algorithm in building a decision tree is to choose the attribute to be used as the root, then create branches until all attributes have the same class (Muzakir & Wulandari, 2016).Determination of the attributes used is based on the highest gain value of all the attributes in the dataset (Sulistyo, 2020).

Discretization
Discretization is used when preprocessing data by changing the value of a continuous-valued attribute into an attribute that has an interval value with a discrete numeric value (Mirqotussa'adah et al., 2017).Discretization has an important role in the data preprocessing stage in data mining, especially in classifying data (Kapoor et al., 2017).
The purpose of the discretization process is to determine a set of cut points to divide the range into a small number of intervals.There are two main functions of discretization.First, find the number of discrete intervals.Second, look for the width or boundaries of the interval given the range of values of the continuous-valued attribute (Kapoor et al., 2017).

Particle Swarm Optimization
Particle Swarm Optimization is a search method that has a simple structure, strong operating capabilities, and is easy to implement (Sulistyo, 2020).The function of Particle Swarm Optimization is to find a set of particles that is updated in all iterations.To achieve the optimal solution, each particle moves towards the personal best and global best position of a group (Sundaramurthy & Jayavel, 2020).
Particle Swarm Optimization is one of the techniques in the optimization field that can handle optimization problems on data that has continuous or discrete values (Sengupta et al., 2018).Differences in Particle Swarm Optimization with other optimization techniques is that each Particle Swarm Optimization particle is associated with a velocity.Particle Swarm Optimization is almost the same as Genetic Algorithm, but Particle Swarm Optimization does not use evolution operators such as mutation and recombination or crossover (Muslim et al., 2019).The Particle Swarm Optimization process begins with the initialization of the Particle Swarm Optimization parameters, initializing the initial particle position and velocity, calculating the fitness value, updating the personal best and global best, updating the particle position and velocity, completing all iterations, then displaying the optimal solution.

Method
In this research, prediction of the results of football matches is carried out by applying the C4.5 algorithm as a classification method for data mining, then discretization is used at the data preprocessing stage and Particle Swarm Optimization is used in the feature selection process to obtain the selected attribute.The result of this research is the accuracy resulting from the proposed method.The flowchart of the method proposed in this research can be seen in Figure 1.

Results and Discussion
This section is divided into 2 parts, results and discussion.The results are a description of the data and findings obtained using the methods and procedures described in the data collection method.Discussion is a review of the results that answer research questions more comprehensively.

Data Collection
The results of data collection in the form of datasets that have been obtained from the footballdata.co.uk site is then divided into two types of data, namely football match statistics and bookmaker odds prediction.The data taken is in the form of football match statistics, while the bookmaker odds prediction data is deleted.The data used has a total of 1,520 instances.The football match statistics dataset can be seen in Table 1.  1, the dataset will then be processed at the preprocessing stage using the Discretization method and feature selection using Particle Swarm Optimization.

Data Preprocessing
The result of data preprocessing is to change the value in the FTR attribute which was originally a polynomial value to be numeric.In the discretization process, not all attributes will be applied to the discretization process.Attributes that are not applied by the discretization process are HTHG, HTAG, and FTR.The HTHG and HTAG attributes are attributes for the number of goals entered by the home and away teams.If the attribute is applied to the discretization process, it will affect the classification results and reduce the accuracy value obtained.The discretization results can be seen in Table 2.
The attributes applied by the discretization process are, HS, AS, HST, AST, HF, AF, HC, and HY.The attribute initially has a continuous value, then a Discretization process is applied which is divided into two intervals with equal frequency division.So that the discretization results data changes to values 0 and 1.

Feature Selection
The results of the feature selection process using Particle Swarm Optimization will be used for the classification process in data mining.Where the attribute that produces a value of 0 will not be used in the data mining process.While the attribute that produces a value of 1 will be used in the data mining process.The results of the feature selection process can be seen in Table 3.The results obtained from the feature selection process with Particle Swarm Optimization using the initialization of several parameters, namely, C1 = 0.7, C2 = 0.7, w = 0.72, number of particles = 50, and iteration = 100.Where the selected attribute is HTHG, HTAG, AS, HST, AST, HC and HY.While the attributes that are not used are HS, HF, and AF because they have a less important influence in the data mining process.

Data Splitting
Results The data separation process uses a ratio of 90:10.Because this number is the best data split (Prabowo, 2020).The more training, the better the learning model obtained by the system.The results of the data split can be seen in Table 4 and Table 5.Based on Table 4 and Table 5, the training data is obtained from 90% of the total number of 1,520 instances, so that 1,368 rows are obtained for training data.While testing data obtained 10% of the total number of 1,520 instances resulting in 152 rows of data.The data mining process is carried out into 3 testing processes.The first process is prediction of the results of the English Premier League match using the C4.5 algorithm without preprocessing and feature selection.The results were evaluated using a confusion matrix.Table 6 shows the results of the classification calculated using the confusion matrix.

× 100%
(1) Based on Equation 1, the accuracy resulting from the confusion matrix in the Table 6 is: The accuracy obtained from the use of the C4.5 algorithm without data preprocessing and feature selection in the prediction of English Premier League match results is 57.24%.The accuracy obtained is considered too low, so the next testing process is carried out.
The second process is prediction of English Premier League match results using the C4.5 algorithm with preprocessing Discretization without feature selection.The dataset resulting from discretization is carried out by a data mining process and then the results are calculated using a confusion matrix.Table 7 shows the results of the confusion matrix in the second testing process.Based on Equation 1, the accuracy resulting from the confusion matrix in the Table 7 is: The accuracy obtained from the use of the C4.5 algorithm with preprocessing discretization without feature selection in the prediction of English Premier League match results is 65.13%.The accuracy obtained still needs to be improved, so further testing is carried out.
The third testing process is prediction of English Premier League match results using the C4.5 algorithm with preprocessing Discretization and feature selection Particle Swarm Optimization.The discretized dataset with 7 selected attributes will be used in the data mining process.The results of this process are calculated using a confusion matrix.Table 8 shows the results of the confusion matrix in the third test.Based on Equation 1, the accuracy resulting from the confusion matrix in the Table 8 is: The accuracy obtained from the use of the C4.5 algorithm with preprocessing Discretization and feature selection Particle Swarm Optimization in the prediction of English Premier League match results is 71.05%.Based on this accuracy, the Discretization and Particle Swarm Optimization processes are able to improve the accuracy of the use of the C4.5 algorithm in predicting the results of English Premier League matches.

Discussion
In this research, a comparison of accuracy using the C4.5 algorithm was carried out before and after the application of preprocessing and feature selection.At the data preprocessing stage, the Discretization method and feature selection using Particle Swarm Optimization are used in predicting the results of English Premier League football matches.Based on the application of the proposed method, there is an increase in the accuracy of the use of the C4.5 algorithm.the following comparison of accuracy obtained from the use of the proposed method can be seen in Table 9.Based on the application of the proposed method, the accuracy of the implementation of the Discretization preprocessing and Particle Swarm Optimization feature selection on the C4.5 algorithm increased by 13.81%.The increase in accuracy is obtained from the accuracy of C4.5 without preprocessing and feature selection getting an accuracy of 57.24% compared to the C4.5 algorithm with preprocessing and feature selection getting an accuracy of 71.05%.The increase in accuracy is due to the discretization process making datasets have data that is more concise and easily understood by the system.The value of the discretized dataset is in the range of 0 and 1.And in the feature selection process, the selected attributes that have the most influence on the classification results are obtained.The selected attributes obtained from the selection of the Particle Swarm Optimization features are HTHG, HTAG, AS, HST, AST, HC, and HY.
The application of preprocessing Discretization and feature selection using Particle Swarm Optimization on the C4.5 algorithm obtained a higher accuracy value than the previous research conducted by Prabowo in 2020, where the K-Nearest Neighbor and Naïve Bayes Classifier methods were used on the same dataset as this research.accuracy as follows, K-Nearest Neighbor 62.7% and Naïve Bayes Classifier 63.5%.In that study, the use of K-Nearest Neighbor and Naïve Bayes Classifier was not suitable when encountering irrelevant attributes, such as attributes that did not have an important influence on the classification results (eg HS, HF and AF attributes).So in this study, a feature selection process was carried out to get the selected attributes.And in the K-Nearest Neighbor and Naïve Bayes Classifier methods require large memory usage, then this study overcomes this problem by conducting the Discretization preprocessing process.This is done to reduce memory usage on the system by making the values in the dataset a value that has two strings of 0 and 1.Then, the research conducted by Baboota & Kaur in 2019 using the Gradient Boost method resulted in an accuracy of 59%.The Gradient Boost method takes a long time to create a classification model, because it must complete one tree first to build the next tree by reducing errors.As well as research conducted by Alfredo & Isa in 2019, where the selection feature used is Wrapper Model and then classification using Random Forest produces an accuracy of 68.55% and Extreme Gradient Boosting 67.89%.Research using Random Forest and Extreme Gradient Boosting has weaknesses in terms of making classification models, learning cannot be repeated and runs slowly.In this study, the C4.5 algorithm is used because it has better performance and runs optimally.Comparison of the accuracy used by previous studies can be seen in Table 10.Based on Table 10 the proposed method to get better accuracy results than previous studies.The proposed method gets better accuracy because in the proposed method the data preprocessing process uses Discretization and the feature selection process uses Particle Swarm Optimization.While the research method carried out by previous researchers did not carry out the preprocessing and feature selection process.

Conclusion
Based on the results of research and discussion related to the application of Discretization and Particle Swarm Optimization methods on the C4.5 algorithm to predict the results of English Premier League football matches with datasets obtained from football-data.co.uk, the application of the proposed method can produce better accuracy.The data processing process starts from data preprocessing by applying the Discretization method with the number of n_bins = 2 with an equal frequency strategy.Then the discretized data is carried out by a feature selection process using Particle Swarm Optimization so that 7 selected attributes are obtained.Then the discretized data with selected features is carried out by a data mining process with split data distribution, 90% for training data and 10% for testing data used for the C4.5 algorithm classification process.
The highest accuracy results from the C4.5 algorithm without Particle Swarm Optimization and Discretization are 57.24%, the C4.5 algorithm with Discretization is 65.13%, and the C4.5 algorithm with Discretization and Particle Swarm Optimization is 71.05% in performing English Premier League match predictions.Increased accuracy resulting from the C4.5 Algorithm before applying Discretization and Particle Swarm Optimization with C4.5 after applying Discretization and Particle Swarm Optimization by 13.81%.

Figure 1 .
Figure 1.Flowchart of proposed method Based on Figure 1 before the classification process is carried out, the dataset of the statistics results of the English Premier League matches used is first processed at the preprocessing stage by applying Discretization.The results of the discretization data are then used for the Particle Swarm Optimization feature selection process.The flowchart of Particle Swarm Optimization can be seen

Table 1 .
Football match statistics dataset

Table 3 .
Result of Feature Selection Particle Swarm Optimization

Table 4 .
Training Data

Table 5 .
Testing Data

Table 8 .
Confusion Matrix using C4.5 Algorithm with Discretization and Particle Swarm Optimization

Table 9 .
Accuracy Comparison

Table 10 .
Comparison of the Accuracy by Previous Studies