Identification of Tuberculosis Patient Characteristics Using K-Means Clustering

In Indonesia, tuberculosis remains one of the major health problems unresolved. Indonesia is second ranked in the world as the country with the most tuberculosis cases. The purpose of this research is to study how K-means clustering applied to the treatment of tuberculosis patients data in order to identify the characteristics of tuberculosis patients. The results of K-means clustering validated by gene shaving and silhoutte coefficient. The experiment results indicate the optimum clusters value obtained from the K-mean clustering that has been validated by gene shaving and silhouette coefficient. K-means clustering divided four groups of tuberculosis patients based on their characteristics. There were divided at a category of disease (pulmonary TB, Extra Pulmonary TB and both), the age of the patient and the results of treatment of tuberculosis.


INTRODUCTION
Tuberculosis is an infectious disease caused by Mycobacterium tuberculosis. Each year, the WHO estimates that 8.7 million new cases and 1.4 million died of tuberculosis cases. In Indonesia, tuberculosis remains one of the major health problems unresolved. Indonesia is second ranked in the world as the country with the most tuberculosis cases, after India, Indonesia, and China [1]. According to data from Persahabatan Hospital, the number of new patients about 1500 per year. Efforts to control tuberculosis cases are implementing the DOTS strategy (Direct Observed Treatment Shortcourse) which has been implemented at the clinic or hospital within 6-9 months [2].
The application of clustering in the data patient parameter had been done to grouping the variable of Electronic Health Record (EHR), which comprises 24 value lab results clinics and 60 concept clinic of clinic records by using a single linkage agglomerative clustering [3]. Clustering is also applied to data tuberculosis patients in Ethiopia based on spatial data patient. The clustering results are used as material planning control program national tuberculosis can be more effective by identifying the group and target interventions [4]. K-means clustering method ever be applied to identify subgroups of patients based on the response to the Patient -Physician Discordance Scales (PPDS),health status and clinical visits [5].
The purpose of this research is to study how the k-means clustering applied to the treatment of tuberculosis patients data for 6-9 months. The clustering is expected to identify subgroups of patients based on patient characteristics and results of their examinations to treatment.

a.
Research Data The data used is data administration reports of tuberculosis at the Persahabatan Hospital, East Jakarta. Patient data consists of a progress report on the implementation of the DOTS strategy in a period of 6-9 months. The amount of data used for this study were 235 patient data. The data will be used consists of 11 variables, the 3 variables result of examination of sputum (the beginning of treatment, the second month, and the end of treatment), 3 the results of weight measurement (the beginning of treatment, the second month, the end of treatment), sex, control taking medication, age, type of tuberculosis (pulmonary or extrapulmonary) and category of the intensive phase.

b. Research Methods
Each of tuberculosis patients came to the hospital to undergo inspection pursuant to which the DOTS strategy related recorded sputum examination results and weight. Their incomplete patient data on several parameters needed because at the time the examination is not measured or officer negligent in the treatment of patients recorded in the card. The methods used in solving this problem can be seen in Figure 1: After a tuberculosis patient examination data entered into the system, to ensure that all parameters have value then applied to linear interpolation. It's as has been done by interpolating the data Hripcsak clinical lab test results and records the concept of patient medical records are still incomplete [6].
After data is interpolated, clustering techniques are applied using K-means clustering which consists of three steps: 1. Determine the centroid / midpoint of each cluster with random. 2. Determine the distance of each object on the coordinates of the midpoint. The algorithm K-Means will do the repetition step by step until a stable (no object changed) [7].

Figure 1. Research Method
Calculation of "dissimilarity" or the distance between the parameters by centroid using the Euclidean distance, A and B: ( ) (( ) ) (1) where A and B are variable values will be calculated distance, d is the distance of each object on the coordinates of the midpoint.
3. Classifying the object is based on the minimum distance. Testing the performance of algorithm K-means clustering was conducted by Gene Shaving. There are three kinds of cluster variance,ie within variance (VW), between variance (V B ) , and the total variance (V T) [8]. Counting cluster variance is as follows: where a is the number of attribute values that exist in the data (number of columns), n is the number of data (the sheer number of rows), x is the value of data and m j is the average of the data on each attribute. V W is used to view the results of variation of the spread of the existing data on a cluster (internal homogeneity).The smaller the value of V W, the better cluster, the because it shows the coherent members cluster.
where a is the number of attribute values that exist in the data (number of columns), m j is the average data at each attribute and m is the value of cluster centroid .V B is used to view the results of variation data dissemination inter-cluster (external homogeneity).The larger the value V B, the better the cluster is formed.
where a is the number of attribute values that exist in the data (number of columns), n is the number of data (the sheer number of rows), x is the value of data and m is the value of centroid of the cluster.
Having obtained the value of the three types of cluster variance,the can be calculated magnitude variance ratio between the variance within the variance between the following manner: The value of R 2 indicates the level of coherent members in one cluster.Rated R 2 will show better results when a large value, which is determined from the value V B are getting bigger and the value of V W which is getting smaller. The value of R 2 will be stored in the form of a matrix KXK sized according to a number of the clusters.The average of the value of R 2 of everything is stored in the value calculation function. Tahap next gap in the following manner: The best of cluster gap value will indicate the value of optimum k-cluster.
In addition to the technique of shaving gene,to validate the results of clustering could use the silhouette coefficient. Silhoutte is one popular technique for determining the optimum k value of k-means clustering.formula Silhouette is as follows: where a(i) is the average distance between points i and all points in cluster A and b (i) is the average distance between the point i and the points in the cluster closest to cluster A, namely cluster B [9].
Interpreting results of silhouette coefficient can be shown on the chart interpretations of the interval coefficient [10], which is in table 1.

RESULT AND DISCUSSION
Experiments were carried out as many as four scenarios to input a different number of clusters (k), starting with k = 2, 3, 5, and 7 with iterations 20 times to get the K-means clustering convergent. Experiment K-means clustering using MATLAB functions Kmeans with parameter replicates for iteration according to the procedure and parameter 'distance' and 'sqEuclidean' which shows the use of euclidean distance to calculate the distance to the centroid point.
Plot the results of K-means clustering for k = 2, 3, 5, and 7 can be seen in Figure 4. Difference members cluster are indicated by differences in color, while the color equation indicates that the data are entered in the same cluster.

Figure 2. Plot of the Results K-means clustering
At every stage of experiments, the K-means clustering for k = 2, 3, 5, and 7 were calculated the within variance values,the variance between values and the total variance values.The three types of cluster variance were used to calculate the ratio cluster variance.The value of the variance within the K-means clustering results for k = 2, 3, 5, and 7 can be seen in Table 2. The value of variance within this indicated the variance in a single cluster,the smaller the value, the more coherent, compact and similar members in the cluster. Besides within variance,experiment K-means clustering also scores between variance. Between variance calculated the distance between cluster one with cluster another formed at. clusterThe following Table 3 shows the variance between clusters in cluster k = 4 and Table 4 shows the total variance cluster in the clustering at k = 4. After getting the within variance value, the between variance value and the total variance value, the next step was to compute the value of the variance ratio by applying the formula 5.The variance ratio at k = 4 can be seen in Table 5. Having obtained the matrix variance ratio as shown in Table 5, it is averaged for the calculation gap cluster. At k = 4, the average ratio variance is obtained 81.88423 so that a large can be calculated gap on the results of clustering k = 4.The calculation results gap can be seen in Table 6. In addition to using the techniques of gene shaving to validate the results of K-means clustering, experiments were also conducted with visualizing from results the clustering by Silhoutte plot. Silhoutte is to measure how well the grouping members in the cluster, namely to see how the value of the Silhoutte coefficient.The higher the Silhoutte coefficient is the better clusters. the formed The results of the experimental implementation of the k-means algorithm to cluster number k = 2, 3, 5, and 7 on the treatment of tuberculosis patients data who have been through validation gene shaving techniques and Silhoutte results obtained maximum value cluster gap and mean of Silhoutte coefficient to determine the optimum k on clustering.The maximum cluster gap value and the mean Silhoutte of each cluster can be seen in Table 7. From table 6 max cluster gap values and mean of silhouette coefficient can be seen that at k = 4 indicated the optimum value. On the results of k-means clustering for k = 4 maximum cluster gap worth 11.3214, which is the maximum value cluster gap of the largest than others. Similarly, in calculation results from the mean value silhouette results clustering when k = 4 at 0.5497, the value is the highest value than others. The mean of Silhoutte coefficient was classified type 2, which can be interpreted that the cluster has found a reasonable structure or be referred by producing a reasonable cluster.
Results k-means clustering for k = 4 in tuberculosis patient data can divide groups of patients based on specific characteristics. Four clusters generated in this study are shown in Table 7. The result of identification characteristics of the patients showed that the k-means clustering algorithm can be applied to the treatment of tuberculosis patients data. The cluster results may indicate that the pattern in the group of patients at some data variables, such as the category of tuberculosis (TB Pulmonary, extra-pulmonary TB, or both), age of the patient, and the results of treatment of tuberculosis.

CONCLUSION
Application of k-means clustering in the data treatment of tuberculosis patients produce that k = 4 is the optimum k cluster.This was validated by the technique of shaving gene and Silhoutte coefficient. The gene shaving generated cluster gap value processed from the count within wariance, between variance, total variance and variance ratio. The Silhoutte calculate the mean Silhoutte value. The value of k optimum cluster can be determined from the results of the cluster in which the value of k gap maximum and the highest mean Silhoutte value. The results of this study found that the k-means clustering algorithm can divide groups of tuberculosis patients based on their characteristics, namely by category of disease (pulmonary TB, Extra Pulmonary TB and both), the age of the patient and the results of treatment of tuberculosis. It can be used as a consideration in the decision-making related to the treatment of tuberculosis by observing the characteristics of the patient.