Village Potential Mapping: Comprehensive Cluster Analysis of Continuous and Categorical Variables with Missing Values and Outliers Dataset in Bogor, West Java, Indonesia

Authors

  • Nafisa Berliana Indah Pratiwi IPB University Author
  • Indahwati IPB University Author
  • Anwar Fitrianto IPB University Author

DOI:

https://doi.org/10.15294/sji.v11i2.3903

Keywords:

Clustering, Mixed dataset, Missing value, Outliers, Village mapping

Abstract

Purpose: This research emphasizes the need to map villages' conditions and identify village potentials, evaluate the effectiveness of development capability, and address the rural-urban development gap with clustering algorithms. The study employs the village development index (IPD) indicators obtained from the village potential dataset, with various numerical and categorical indicators, to capture both tangible and intangible aspects of village potential. Challenges such as missing data and outliers in IPD data collection can be found. The study aims to evaluate the effectiveness of clustering algorithms, with integrated and separated imputation processes, in handling these data issues and to track the development of villages in the Bogor Regency, West Java, Indonesia, based on the village’s potential (PODES) dataset.

Methods: Three clustering algorithms, such as k-prototype, simple k-medoids, and Clustering of Mixed Numerical and Categorical Data with Missing Values (k-CMM) are compared. The pre-processing data, which is the imputation process for the first two algorithms, is conducted separately, while the k-CMM has an integrated imputation process. Both imputation stages are tree-based algorithms. Cluster evaluation is based on internal criteria and external criteria. Clusters resulting from the k-prototype and simple k-medoids are selected by internal validity indices and compared to k-CMM using external validity indices for several numbers of clusters (k = 3,4,5).

Result: According to data exploration, the IPD of Bogor Regency, West Java, Indonesia dataset contains ± 5% of outliers and six missing values in some chosen variables. Tree-based imputation methods are applied separately in k-prototype and simple k-medoids, jointly in k-CMM. Based on the elbow and gap statistics methods, this research aims to determine the optimum number of clusters k = 3. The internal validity indices performed on k-prototype and simple k-medoids resulting in three clusters (k = 3) are optimum. Trials on several clusters (k = 3,4,5) for three algorithms show that the k-prototype with k = 3 performs the best and is most stable among the two other algorithms with IPD datasets containing many outliers; external validity indices evaluate cluster results.

Novelty: This research addresses issues commonly found in mixed datasets, including outliers and missing values, and how to treat problems before and during cluster analysis. An improvement of Gower distance is applied in the medoid-based clustering algorithm, and the k-CMM algorithm is the first algorithm to integrate the imputation process and clustering analysis, which is interesting to explore this algorithm’s performance in clustering analysis.

Downloads

Article ID

3903

Published

21-05-2024

Issue

Section

Articles

How to Cite

Village Potential Mapping: Comprehensive Cluster Analysis of Continuous and Categorical Variables with Missing Values and Outliers Dataset in Bogor, West Java, Indonesia. (2024). Scientific Journal of Informatics, 11(2), 353-366. https://doi.org/10.15294/sji.v11i2.3903