Toddler Nutritional Status Classification Using C4.5 and Particle Swarm Optimization

. Purpose: This research was conducted to create a classification model in the form of the most optimal decision tree. Optimal in this case is the combination of parameters used that will produce the highest accuracy compared to other parameter combinations. From this best model, it will be used to predict the nutritional status class for the new data. Methods/Study design/approach: The dataset used is from Nutritional Status Monitoring in 2017 in Riau Province, Indonesia. From the dataset, the Knowledge Discovery in Database (KDD) stages were carried out to build several classification models in the form of decision trees. The decision tree that has the highest accuracy will then be selected to predict the class for the new data. Predictions for new data (unclassified data) will be made in a web-based system. Result/Findings: Particle Swarm Optimization is used to find optimal parameters. Before PSO is used, there are 213 parameters in the dataset that can be used to do classification. However, using many such parameters is time-consuming. After PSO is used, the optimal parameters found are the combination of 4 parameters, which can produce the most optimal decision tree. The 4 chosen parameters are gender, age (in months), height, and the way to measure the height (either stand up or lie down). The most optimal decision tree has an accuracy of 94.49%. From the most optimal decision tree, a web-based system was built to predict the class for new data (unclassified data). Novelty/Originality/Value: Particle Swarm Optimization (PSO) is a method that can help to select the most optimal parameters, or in other words produce the highest classification accuracy. The combination of parameters selected has also been confirmed by the nutritionist. The prediction system has been declared feasible to be used by nutritionists through the User Acceptance Test (UAT).


INTRODUCTION
Nutritional status is a result of daily nutritional intake that describes an individual [1]. The nutritional status could assist and monitor the optimizing process of individual growth and development [2]. The toddler age (also known as the golden age) is an important period to optimize the growth and development of toddlers [3]. Toddler nutritional status is evaluated based on three indicators, namely weight-for-age (W/A), weightfor-height (W/H), and height-for-age (H/A) [4]. Based on H/A, the nutritional status of toddlers can be categorized into very short, short, normal, and high [5]. The combination of very short and short categories is known as stunting. The results of Nutritional Status Monitoring in 2017 show that the problem of H/A nutritional status (stunting) is the highest nutritional problem compared to W/A (underweight) and W/H (wasting) as shown in Figure 1 and Figure 2 [4].
Stunting is a condition of toddlers who are too short for their age due to inadequate nutrition and repeated bouts of infection during the first 1000 days of their life [6]. There are several causes of this condition, including poor parenting and a lack of consuming nutritious food due to the mother's lack of knowledge. Stunting has an impact on intelligence and physical development, more susceptibility to diseases, and atrisk of decreasing productivity in the future. Another broad impact of stunting is that it inhibits economic growth and increases poverty [7]. One of the health development goals to reduce the number of stunting (also called the Stunting Intervention Program) is providing public nutrition education through the Public Nutrition Improvement Program. However, this program has not been done so much that the Stunting Intervention Program has not been effective [7]. One of the actions recommended by [8] to deal with stunting is to carry out preventive activities. Prevention activities that can be carried out include increasing identification, measurement, and understanding of stunting. This shows the importance of delivering information about stunting to the public. Thus, they can ensure their toddlers are not classified as stunting.
Based on the existing problems, the author makes an application that can help people know the H/A nutritional status of their toddlers and get information about stunting. With this application, the public does not need to wait for the Public Nutrition Improvement Program to be held. In building the application, the author applies the C4.5 algorithm [9] and Particle Swarm Optimization [10] on the data of Riau Province Toddlers in 2017 to classify the H/A nutritional status. The results of the algorithm will be applied to the web-based application, thus people can find out the H/A nutritional status of their toddlers and get information about stunting.
The C4.5 algorithm is known to exceed the Learning Vector Quantization (LVQ) algorithm in an average accuracy and has a faster processing time than the K-Nearest Neighbors (K-NN) method in classifying student abilities [11]. Another study that uses the C4.5 algorithm is also shown good performance for classifying the eligibility value of new prospective debtors [12] and exceeding Naïve Bayes in predicting the future behavior of the customers [13]. The C4.5 algorithm can also be optimized with PSO to select parameters/features [14]- [18]. This feature selection is done to avoid unrelated features that can reduce the performance of the classifier model. C4.5 algorithm optimized with PSO can exceed the accuracy of Support Vector Machine (SVM), Self-Organizing Map, Back Propagation Neural Network (BPNN), and C4.5 in classifying cancer [19]. Other studies also prove that the average accuracy of C4.5 combined with PSO in classifying 5 medical data sets can outperform the average accuracy of Logistic Regression, SVM, BPNN, and C4.5 [20]. Research related to the classification of nutritional status that has been done previously is the classification of W/A nutritional status of toddler using LVQ with the 80% accuracy (Euclidean) and 20% accuracy (Manhattan) [21]. Another study that classifies nutritional status by using Naïve Bayes reach 93.2% accuracy [22].

METHODS
The research methods of this study as in Figure 3:

Data Collection
At this stage, data collection is conducted. The data collected is toddler data from the result of Nutritional Status Monitoring in 2017 in Riau Province, Indonesia. The data obtained consisted of 3.961 records and 213 parameters.

Analysis
At this stage, the author analyzes the data of toddlers starting from the data selection, pre-processing, transformation, and data mining [23]. The best accuracy (from the optimum parameters or decision tree) would be found in the data mining step. 1) Data selection. This is the step to select data and parameters to be used. In this study, 3.961 records and 19 parameters were selected. The selected parameters are expected to have a relation with H/A nutritional status of toddler. This step is done by a discussion with the nutritionist. Data should only be filled with numbers. However, there is a value of "#NULL!". The action taken at this step is to change the "#NULL!" value to "99". d) Outlier Deletion of records containing the outliers is done, thus the remaining records are 3.629. There are 15 records that contain outlier. At the end of this step, the total record that has been deleted is 332 records. Thus, the percentage of data loss is 8.3%.
3) Transformation. This is the step to change the data format to the format needed for data mining. For example, the data mining step in this research would form decision tree based on the parameters. Thus, the data should contain only the necessary parameter (that has relation to the class target). However, the ID parameter is not necessary for the classification process (or data mining step) as it is only needed in the pre-processing step (removing duplicate step). Thus, the ID parameter could be deleted. The total of parameters after deletion is 18 parameters.
4) Data mining. This is the step to apply the C4.5 algorithm and PSO to the data using Matlab. The following is a flowchart that describes the flow of the C4.5 algorithm and PSO in Matlab [5], [9], [10]. The maximum iteration, w, c1, c2, vmin, and vmax parameters are set based on the previous research that uses C4.5+PSO on a similar domain, which is the medical problem [20]. The number of particles is set as many as the number of parameters (excluding the target or class parameters). The r1 and r2 are obtained from the random function in Matlab (built-in function). 2) Initialize particle position and velocity randomly. The particle position is a 17-bit binary number that represents each parameter in the data (the 18 th parameter becomes the target class, thus only 17 parameters left). If the binary bit is 1, then the parameter will be used in the classification process (calculation of particle's fitness value).
3) Calculate the fitness value of each particle using C4.5 algorithm. The parameter to be used in this process (to be used in forming decision tree) is based on the particle's position. C4.5 algorithm uses gain and gain ratio to form decision trees using the following formula [11]: The particle fitness value is obtained based on the accuracy of the decision tree that produced by C4.5 algorithm. The following is an example of confusion matrix that is used to calculate accuracy [11].
Select particle best (pb) and global best (gb) from particle's fitness values. Pb is the best fitness reached by each particle, while gb is the best fitness among its neighbors. 4) Update the velocity (v) and position (x) of particle based on pb and gb. Here is the formula used [19].
Explanation: = new velocity of particle i = i th particle d = d th dimension w = Inertia weight = current velocity of particle 1 = learning rates of particle 2 = learning rates between particles 1 , 2 = random number between 0 and 1 pb = the position of the particles with the best fitness reached by each gb = the position of the particle with the best fitness among its neighbors = current position of particle = new position of particle ( ) = 1 1 + − ran[0,1] = random number between 0 and 1 5) Repeat steps 3-5 until the maximum iteration reached or 100% accuracy (gb is 100) obtained.
Whichever reaches first.
The result of the data mining step is the fastest particle to achieve the best accuracy. The decision tree of that particle (the best decision tree) will be applied to The Application of Toddler Nutritional Status. The flowchart will be used to analyzing the flow of the application as in Figure 5. 1) The user selects the "check nutritional status" menu on the main page.
2) The application displays a form to check the H/A nutritional status of toddlers.
3) The user inputs the toddler's data into the form. 4) The application processes toddler data using a decision tree. 5) The application displays the results (H/A nutritional status of the toddler). 6) The user selects the "stunting info" menu. 7) The application displays a page about stunting info, such as the definition of stunting, causes, effects, and recommendations (one of them is a balanced food menu for toddlers). 8) The user selects the "balanced food menu" menu. 9) The application displays a page about a balanced food menu which is recommended by a nutritionist.

Design
At this stage, the author designs the concept of database, menu structure, and interface of the application. The database design will be used as a reference when creating the data storage of the application. The menu structure and interface design will be made, thus the application has a display that meets the user-friendly and usefulness aspects.

Implementation
At this stage, application development is done based on the analysis and design that has been done before. The interpretation of the best decision tree is done by changing the decision tree details from Matlab into an if-else rule in the code line for web-based applications. This if-else rule will be used to predict H/A nutritional status of toddlers.

Testing
The percentage of testing data being used is 10% or 363 records. Confusion matrix is used for the accuracy testing of C4.5 algorithm and PSO. Application (web-based) testing is done by two methods, namely black box and User Acceptance Test (UAT).

RESULT AND DISCUSSION
The result of the data mining step (C4.5 and PSO) after the 100 th iteration is done is shown below. Based on the table above, gb (biggest pb or the best accuracy) is 94.49%. The fastest particle to achieve the best accuracy is 7 th particle. The 7 th particle obtained such accuracy only after the 15 th iteration. The other obtained the same accuracy (6 th particle, 13 th particle, etc) after the 15 th iteration. The 7 th particle obtain 94.49% accuracy when the particle position is 00001100000101100 (5 parameters used). These parameters are: 1) B4R4 is gender.
3) B27 is about history of being treated for malnutrition (yes or no). 4) F02B is height. 5) F02C is way to measure height (stand or lie down).
In forming a decision tree (using C4.5 algorithm), the B27 parameter has never been chosen as a tree branch. Only 4 parameters are used, namely B4R4, UMUR_BULAN, F02B, and F02C. These parameters are the most important and influential in determining H/A nutritional status of toddlers, as the PSO works as feature selection to avoid unrelated parameters that can reduce the classifier performance.
Based on the interview with nutritionist, the influence of gender parameters is men are higher than women. The influence of age parameters is that the more you age, the higher your body height. The influence of height parameters is the more optimal height, the more possible for your H/A nutritional status is included in the normal category. The influence of way to measure the height, namely the height measured while standing will be 0.7 cm shorter than lying down.
The following is the detailed calculation of the best accuracy reached using the confusion matrix. From the results of UAT testing by five nutritionists (Figure 6), it can be concluded that the application features are in accordance with the science of nutrition. These features are Nutritional Status Checking (QN2), Stunting Info (QN4), and Balanced Food Menu Info (QN5). It is also considered easier in predicting toddler nutritional status (stunting) compared to the manual method, which is manually counting Z-Score (QN3). This application also has an interface that is easy to understand (QN1), the features provided are quite in accordance with the needs (QN6), and overall the application is classified as good by nutritionists (QN7).
From the results of UAT testing by five general public (Figure 7), it can be concluded that the public can obtain the information about stunting (QP4) and the information about balanced nutritious food menu (QP5) from the application. The general public can also check the toddler nutritional status (stunting) (QP2) more easily than by using manual methods, which is visiting health center (QP3). This application also has an interface that is easy to understand (QP1), the features provided are in accordance with the needs (QP6), and overall the application is classified as good by the general public (QP7).

CONCLUSION
Based on the research methodology that has been done, several conclusions can be drawn as follows. The accuracy of C4.5 algorithm and PSO in classifying H/A nutritional status of toddlers is 94.49% with the 90:10 ratio and 4 parameters. Of the 19 parameters selected, there are 4 parameters that could reach 94.49% accuracy, namely the gender, the age (in months), the height, and the way to measure the height. These parameters are the most important and influential in determining H/A nutritional status of toddlers, as the PSO works as feature selection to avoid unrelated parameters that can reduce the classifier performance. The influence of gender parameters is men are higher than women. The influence of age parameters is that the more you age, the higher your body height. The influence of height parameters is the more optimal height, the more possible for your H/A nutritional status is included in the normal category. of way to measure the height, namely the height measured while standing will be 0.7 cm shorter than lying down. The Application of Toddler Nutritional Status can predict the H/A nutritional status of toddlers according to the decision tree produced by C4.5 algorithm and PSO. This application can also provide information about stunting and recommendations for a balanced nutritious food menu for toddlers. Based on the black box testing, the application has been running as expected and in accordance with analysis and design. From the User Acceptance Test (UAT) with the nutritionists, the application is in accordance with the science of nutrition and is considered easier in predicting H/A nutritional status of toddlers compared to the manual method (Z-Score). This application also has an interface that is easy to understand, the features provided are quite in accordance with the needs, and classified as good. From the User Acceptance Test (UAT) with the general public, the application can predict H/A nutritional status of toddlers and it is easier than the manual method (visiting a health center) and the application can also provide information about stunting. This application also has an interface that is easy to understand, the features provided are in accordance with the needs, and are classified as good.