The Utilization of Naive Bayes and C.45 in Predicting the Timeliness of Students' Graduation

An assessment of the success of a college is if the student's graduation rate is on time and high every year. The timeliness of students' graduation can be influenced by several factors. This study aims to determine the profile of the students who graduated both on time and not on time given a certain graduation predicate set by the institution and to know the factors influencing students' graduation. The model used in this study using the NBC to determine the graduation pattern and the Decision tree to determine the influencing factors. In calculating the NBC algorithm using Rapidminer, it was found that the profiles of students who graduated on time and late with the predicate of less satisfactory, satisfactory, very satisfactory and cum laude. In the Decision Tree calculation, the highest gain values are obtained in the IPK3, IPS1, and IPK2 attributes. This research needs to be developed further by increasing the number of attributes and data, and it is necessary to make a system to determine the accuracy of students' graduation from the patterns that have been produced so that it can help universities to increase the level of students' graduation every year.


INTRODUCTION
Students are the most important part of evaluating the success of the implementation of study programs in tertiary institutions. Some efforts to improve the quality of a tertiary institution can be made in various ways, including increasing the quota of new students, increasing students' achievement both in academic and non-academic achievements, and also increasing the graduation rate of students each year. Ngudi Waluyo University, for example, has low graduation rates in Pharmacy study program. The comparison of new students and graduate students is very high, this is because there are many students who graduate not on time. The following Table 1 is a comparison between the level of graduates and the acceptance of new students in the last four years in the Pharmacy study program. Students' graduation is a very influential part in evaluating academic activities in a college. According to the research, an indicator of the success of a college is if the students' graduation rate is high every year [1].
Some previous studies which examined the graduation rate of students are shown in Table 2. Table 2. List three of the previous studies High and low students' graduation rates have many factors that become a problem in college. In this research, data analysis will be conducted at Ngudi Waluyo University on the Pharmacy study program for the year of 2012, 2013, 2014 and 2015. The analysis carried out is by measuring the graduation level of students through predictions of the timeliness of on time and not on time graduation based on the cumulative students' achievement index during second, third and fourth semesters which will be classified according to the graduation predicate consisting of satisfactory, very satisfactory, cumlaude and several other attributes such as NIM, Name, Gender, scores on mathematics courses, PMB (school enrolllment) test scores, origin and place of birth of students, origin of previous schools, and parents / guardians' occupations of students using the C.45 and Naive Bayes algorithm methods that aim to not only compare the higehst accuracy of the two algorithms but also to find out what kind of students' profile who can graduate on time? and what are the factors that can influence the timeliness of students' graduation? It aims to provide information for the institution so that it can

Fountain of Informatics Journal
The highest information gain value is found in the parents' occupational attributes which are used as the root in the study and then followed by the attributes of the area of origin and the type of school origin  Uses Method C.45  The population used students in 2012 increase efforts in encouraging students to accelerate the students' graduation. In addition it is beneficial for students themselves to be able to graduate on time.

Previous Research
Many previous studies have examined the graduation prediction analysis using their own method, some of which are shown in Table 2, which are relevant to this study. The research conducted by Nurliana [1] obtains the results that the use of the NBC method has the best accuracy of up to 76.67%, with the most significant attributes being school origin, achievement index in the first to third year and cumulative restoration index (GPA). It is recommended to look for alternative algorithms other than NBC that have good accuracy values and combine the most significant attributes in determining the right class [1].
In a study conducted by Romadhona [2], the results obtained that the highest Information gain value is in the 4th-semester achievement index (IPS-4) with a value of 0.340 and this attribute is eligible as the root. It is recommended to increase the number of training data records in subsequent studies in order to obtain better performance in the results of accuracy [2]. Subsequent research by Indah Puji Astuti [3] obtains the result that the highest information gain value is found in the parents' occupational attribute which is used as root in the study and then continued with the attributes of the region and type of school of origin. From the results of this study, the C4.5 algorithm has an accuracy value of 82%. It is recommended to look for other factors, in addition to student self-data, for example, academic factors, family economic conditions and psychological factors in determining students' graduation [3].
The difference between the previous researches above and the research that will be carried out is the attribute that is used which is the development of previous studies. It was conducted in a different place of study in determining the graduation criteria where in this study, the graduation attribute is added with the graduation predicates of less satisfactory, quite satisfactory, satisfactory, very satisfactory, and cum laude. And data processing methods used Naïve Bayes and decision tree C.45. Where these two algorithms are combined based on their characteristics, Naïve Bayes can predict the future by knowing the graduation pattern of students, decision tree C.45 can find out the most significant attributes in determining graduation.

Prediction of Timeliness of Students' Graduation
This study has a framework that is based on academic phenomena that occur at Ngudi Waluyo University, namely an imbalance between the number of students entering and graduating in Pharmacy study programs in particular, so it is necessary to find an appropriate solution so that there is an alignment in the academic process that is relevant to evaluate PMB and mathematics exam scores in the first semester and lecture activities in the first 4 semesters of the students supplementing with the attributes of sex, year of birth, type of school of origin, place of origin of students and parents' work. A description of the attributes used is as follows:

The Attribute of Gender
Gender Male (L) and Female (P) are used to determining the level of graduation. According to [4] gender attribute is one of the variables that can be used to determine the level of graduation in students. Learning achievement according to [4] is directly influenced by gender, according to him women have more achievement compared to men.

The Attribute of Mathematics Value
Mathematical value in the Pharmacy study program is used to determine predictions of students' graduation, graduating on time or not. Mathematics grades in first semester are the basis of courses in the Pharmacy study program. The value scales used are A, B, C, D, and E.

The Attribute of School Enrollment Test
The value of student entrance examination is also used to determine the prediction of whether or not the students pass. This value is obtained when the prospective students enroll in the college, the test material is the Academic Potential Test. The range of values given is between 50-90. Based on research [5], the entrance exam results can affect the success of students' studies at a college.

The Attributes Of Semester Achievement Index 1-4 And Gpa 2 -Gpa 4
Semester achievement index and GPA according to [1] is the highest value in determining students' graduation. NBC Achievement Index Algorithm has the highest value in data processing. The higher the semester achievement index value and students' GPA, the higher the opportunities of the students to graduate.

The Attribute of the Place of Birth
The place of birth of students is used in determining students' graduation. Place of birth of students is the origin of the area of students who come from: Java, Lombok, Sumatra, Bali, Kalimantan, Maluku, Riau, East Timor. The attribute of origin of this area is also used in research [6], which from the results of the study shows that regional origin is also a determining variable in predicting graduation.

The Attribute of Students' Age
The age attribute is determined when the students are registered as a student.

The Attribute of Type of Origin School
The origin schools are grouped into three types namely high school, vocational high school and MAN (Madrasah Aliyah Negeri). According to [1] the type of school has the most significant influence in increasing the accuracy value of NBC.

The Attributes of Parent's Work
Parents' work attribute is used to describe the economic level of students, whose families have a steady income or not. The attributes of this parent's work consist of: teacher, private employee, civil employee, self-employed, Fisherman, Farmer.

The Attributes of Graduating on time or not
Graduation prediction can be divided into two, namely graduating on time where undergraduate students can complete a maximum of study in seventh or eight semester, and graduating not on time is for students who complete the study more than 8th semester (Academic Guide).
Graduates' Attributes are categorized by the title of, among others, unsatisfactory, quite satisfactory, satisfactory, very satisfactory and cum laude. Graduation predicate can be categorized as in Table 3.

Naive Bayes Classifier (NBC)
NBC or also called as Bayesian Classification is an algorithm that classifies statistics based on the Bayes theory which is used to predict the probability of a class membership. The main feature of NBC is a very strong assumption of independence of a condition or an event [8]. NBC has been shown to have high accuracy and speed when applied in large databases [9]. Data mining is a form of process to find a relationship that means the pattern of a large set of data stored in storage with statistical and mathematical techniques [7]. Data mining has several models including NBC or classification models. The formula of Bayes theory is as follows: with the information: X = data with an unknown class H = data hypothesis X is a specific class P (H | X) = H hypothesis probability based on condition X(posteriori probability) P (H) = probability of hypotheses H (prior probability) P (X | H) = probability of X based on the conditions in hypothesis H P (X) = probability of X

Algorithm C.45
C4.5 algorithm is a classification algorithm that uses a decision tree model [3] to determine the attributes that become the root of this decision tree model by looking at the highest gain values of the existing attributes. Entropy and gain calculations are obtained by the following equation: pi is the number of data classes divided by total data.

Research Phase
This research has several stages of research that are illustrated as shown Figure 1.

Preprocessing Phase
Students' data obtained were categorized into two categories, namely personal data and academic data. Personal data consisted of students' biodata and previous educational background. While academic data consisted of data on grades and student achievement indices during the learning process. The attributes of this student data included in Table  4.

Transformation Phase
Some attributes of students' data had empty values, so they needed to be removed and the value of the attributes was made simpler to facilitate calculations in data mining. Table 5 and Table 6 are the attributes and their values after the transformation stage.

Data Mining
This study used a data mining classification model, so the data used already had a class target. The target class was to graduate on time or late with the predicates of less satisfactory, satisfactory, very satisfactory, and cum laude contained in the attributes of the graduated target. The algorithms used Naive Bayes and C.45, which will get the most accurate results from the two algorithms. The tool used RapidMiner version 9.5. The Design model using Rapid Miner shown in Figure 2.  There were X data that were not yet known its class as in Table 7.
How the Naive Bayes algorithm worked [10]: 1) First, read the training data in Table 6.
2) Second, calculate the mean and standard deviation of the predictor attributes in each class. The result of this stage shown in Table 8. 3) Third, counted the same number of cases in the same class. The result of this stage shown in Table 9.   Table 10.

Algorithm C.45
How the C.45 Decision tree algorithm worked [11]: 1) First, read the training data in Table 6.
2) Second, did the calculation of the total entropy value of the accuracy label 3) Third, we looked for information gain values for each attribute and chose the largest value from the calculation results. Entropy and Gain calculation results are shown in Table 11. and GPA 3 with a gain value of 1.69 which had the largest gain value, then the attribute would be the root node. 5) Fifth, created subdivisions below the root node from the order of high to low gain values and trimmed / eliminated attributes with low values.

Evaluation
After the data mining process was completed, an accurate model/pattern would be obtained in predicting the timeliness of students' graduation using two algorithms, NBC and C.45. The results of calculations with Naive Bayes can produce graduation patterns based on the attributes used, while C.45 decision trees can find out the most significant attributes in determining students' graduation.

RESULT AND DISCUSSION
The data used in this study were the data of the students who graduated on time with a total of 147 graduates and who graduated late who were 81 graduates. The number of training data was 228 graduates, the testing data were used by five graduates to determine the accuracy of the NBC and C.45 algorithms. The attributes used as parameters were 15 attributes, of which 14 were predictors and 1 was the result. Table 12. NBC calculation results   True  LB  OC  LA  OD  LC  LD  OB  LB  29  13  4  0  12  0  2  OC  2  84  0  4  The results of experiments using NBC (in Table 12) could be known for its result on the label of on-time C (graduated on time with a predicate of very satisfactory) higher than other labels. Table 13, Shown students' profiles based on NBC calculation.

Naive Bayes
The following resultants from Table 14 are produced in the form of trees as shown Figure 3.

Figure 3. Result Decision Tree
In the decision tree in Figure 3, the selected attributes were IPK3, IPK2, and IPS1, while the other attributes were directly trimmed from the decision tree. It could be seen that with the amount of data and types of data that already existed, only a few attributes were needed to get the output class from the dataset.
Calculation of decision tree algorithm as in Table 14, the root node (root) was selected based on the highest gain value. In the calculation using Rapidminer the highest gain value was obtained in the IPK3 attribute followed by the IPS1, and IPK2 attributes followed by the status of the students graduating on time or late with the title of A, B, C, or D. GPA was obtained from the cumulative study results of students in third semester, IPS1 was obtained from the results of the students in first semester, and GPA was the result of the cumulative study of students in second semester. Based on the results of the desicion tree calculation in Figure 3, it could be seen that the attributes that appear in the decision tree were IPK3, IPS1 and IPK2, other attributes (gender, mathematical value, PMB value, origin, school, work) were not displayed or trimmed because they were not selected as the attributes in the decision tree. The three attributes that had the highest gain in calculations using the decision tree algorithm were the factors that affected the student's graduation timeliness. While in the calculation with Naïve Bayes algorithm in finding the status of students' graduation on time with the testing data in Table 14, it could be seen that the students graduated on time with a very satisfactory predicate.
Naïve Bayes algorithm calculation using Rapidminer got the students' profiles as shown in Table 14 Students' profiles can be used as a pattern to find out whether students can graduate on time or late with the title of A, B, C, or D. In this study, the use of two Naïve Bayes and Decision tree algorithms have more complete results compared to studies previous. Naïve Bayes is used for future predictions with patterns generated in graduates and decision tree C.45 is used to determine the attributes that most play a role in students' graduation.

CONCLUSION
Students' graduation can be predicted in an on-time or late manner and can identify the factors influencing it. By using the Naive Bayes Algorithm and Decision Tree Algorithm, this study finds the factors that most influential in the graduation which is the cumulative achievement index in third and fourth semesters as well as the semester performance index of students in the first semester. Graduation patterns can be known from the beginning of the students entering the school until the second year. With this pattern, the study program can try to prepare the students to graduate on time with the specified graduation predicates. And with the attributes that play a role in graduation, the study program can improve students' competence early so that graduation rates become higher each year. The results of this study need to be further developed by increasing the number of attributes and data, and it is necessary to make a system to determine the accuracy of students' graduation from the patterns that have been produced in order to help the university to increase the level of students' graduation each year.