Customer Segmentation Using the Integration of the Recency Frequency Monetary Model and the K-Means Cluster Algorithm

. Purpose: This research aims to do customer segmentation in retail companies by implementing the Recency Frequency Monetary (RFM) K-Means cluster model and algorithm optimized by the Elbow method. Methods: This study uses several methods. The RFM model method was chosen to segment customers because it is one of the optimal methods for segmenting customers. The K-Means cluster algorithm method was chosen because it is easy to interpret, implement, fast in convergence, and adapt, but lacks sensitivity to the initial partitioning of the number of clusters. To help classify each category of customers and know the level of loyalty, they use a combination of the RFM model and the K-Means method. The Elbow method is used to improve the performance of the K-Means algorithm by correcting the weakness of the K-Means algorithm, which helps to choose the optimal k value to be used when clustering. Result: This research produces customer segmentation 3 clusters with a Sum of Square Error (SSE) value of 25,829.39 and a Callinski-Harabaz Index (CHI) value of 36,625.89. The SSE and CHI values are the largest ones, so they are the optimal cluster values. Novelty: The application of the integrated RFM model and the K-Means cluster algorithm optimized by the Elbow method can be used as a method for customer segmentation.


INTRODUCTION
The development of Information Technology (IT), which is increasing, is certainly in line with the conditions of human civilization, including in the business sector [1]. In business, competition requires companies to maximize existing skills as well as possible to compete with other companies [2]. Companies must be able to understand and incorporate customer characteristics into an important thing to consider [3].
Changes in people's economic conditions have an effect on the increasingly fierce competition in the business industry [4]. Along with the increasing competition in the business industry in retail companies, retail companies are required to shift their focus not only to product-oriented but also to implement strategies that also focus on customer-oriented [5]. In doing customer oriented, it is required to know the characteristics of each customer [6]. Customer segmentation is needed to classify customers with the same characteristics. In one customer segmentation is a unified customer group that has the same characteristics [7]. One of the methods used to segment customers is the Recency Frequency Monetary (RFM) model. The RFM model is a model used to determine the difference between important customers from big data based on three variables, namely recency, frequency and monetary [8]. After knowing the customer segmentation group, customer characteristics can be analyzed based on the generated segmentation, so that planning can be done to retain old customers with the most appropriate company service strategy for each customer [9].
Data mining is used to assist in decision making [10], among the methods used are Recurrent Neural Network (RNN) [11], [12], naïve bayesian classifier algorithm [13], and K-Nearest Neighbor (KNN) algorithm [14]. Clustering is a data mining technique by dividing data in a set into groups with the similarity of data in one group greater than the similarity of the data with data in other groups [7]. The K-Means cluster is the most popular and widely studied clustering method to minimize clustering errors [15]. The K-Means approach uses a greedy strategy to generate new partitions by assigning each pattern to the nearest cluster center and calculating the new cluster center [16]. K-Means groups a specified data through several clusters (k clusters). The idea that arises is to provide a definition of the value of the center of k (k centroid), one by one in each cluster. In placing the center value, it must be done smartly because the difference in location is also a factor in the difference in results [17].
To help the process of grouping each category of customers and to know their loyalty level is to use a combination of the RFM model and the K-Means algorithm [18]. The Elbow method is used to improve the performance of the K-Means algorithm by correcting the weakness of the K-Means algorithm, which is helping to choose the optimal k value to be used when clustering [19]. The K-Means method is more optimal in its performance by adding the Elbow method to select the value of k clusters [20].

METHODS
The method used in this study uses the proposed method, namely the integration of the RFM model and the K-Means cluster algorithm. It is optimized by the Elbow method for retail companies in conducting customer segmentation, as shown in Figure 1.

Data Collection
In the process of collecting data, it is carried out to collect the required dataset. The dataset was obtained from Retail Shopping Sale Data, Kaggle. The dataset used is a collection of transactional data from Ecommerce from Turkey in the period from February 1, 2020 to February 1, 2021. The dataset consists of 4 attributes with 62,295 rows. The attributes of the Retail Shopping Sale Data are described in Table 1.

Recency Frequency Monetary Modeling
Recency is the amount of difference between the last transaction time and the current transaction time, or the time when the transaction data was published. Frequency is how often the customer makes transactions at a certain time. Monetary is the nominal amount of money owned by the customer at the time of the transaction, if the amount of money is greater, the value of M will also be greater [21].

Data Standardization
In this process, standardization of data is carried out on the dataset. Data standardization was carried out in this study using the Standard Scaler Standardization. In Retail Shopping Sale Data, after data transformation using the RFM model method, the data range is quite far, with a fairly high range, the clustering process on the dataset may not be successful, so standardization of data is required using the following Equation (1) [22].
with description, u = average sample value s = standard deviation of the training data

Elbow Method
In the clustering process, it is necessary to input the initial k value of the cluster. The Elbow method is applied to assist in determining the value of k as input to the K-Means clustering algorithm. To obtain the optimal input k value, one of the steps that can be done is by applying the Elbow method. The results of the k value selected are obtained from the results of the initiation of the k value range through the right-angled point shown by the graph. The output results of the cluster value are used as input at the clustering stage with the K-Means algorithm. The workflow of the Elbow method is shown in Figure 2.

K-Means Cluster Algorithm
The k-means algorithm is a clustering algorithm to group data into certain groups. To retrieve data on this algorithm by not using class labels (unsupervised learning). The K-Means clustering process divides the data groups that become the input independently. Each cluster has a central point (centroid) that represents the cluster [23]. The K-Means algorithm is used to perform clustering in this study. The input to the K-Means process is the result of the RFM model that has been normalized and the k value obtained by the Elbow method. The K-Means algorithm is used as an algorithm to perform clustering in this study. The input to the K-Means process is the result of the RFM model that has been normalized and the k value obtained by the Elbow method. The following is a flowchart of the K-Means cluster algorithm, which can be seen in Figure 3 [24].  The number of clusters is not the value that has the largest SSE value difference but has the largest SSE difference value after the cluster value k = 2 and the point that shows the elbow on the Elbow chart. In this study, the value of k = 3 was used to cluster data, then an evaluation was carried out using the Callinski Harabaz Index (CHI) method, which is shown in Figure 5. This study records the SSE value of each cluster and is evaluated using the Callinski Harabaz Index method. The results of the comparison of the Elbow and Callinski-Harabaz Index methods can be seen in Table 2. The cluster value generated using the Elbow method after being evaluated using the Callinski Harabaz Index is the cluster value that is proven to be the most optimal. The results of customer segmentation are analyzed to determine the number of members of each cluster that is formed. Figures 6 and 7 show a visualization of the number of members of each cluster and a visualization of the presentation of members of each cluster. Table 3 shows the average RFM of each cluster.  The results of the number of clusters based on the Elbow method and cluster performance tests using the Callinski Harabaz Index method determine the number of clusters to 3, in the process of determining the number of clusters it can make the separation of customer groups still active and tend to have Recency, Frequency and Monetary values that are superior to other groups. The shape of the character above divides the segment into 3, namely cluster 1 in the future which is considered capable of contributing very large profits, so the customers who are included in it by the company really need to be maintained.
The Proposed method in this study resulted in 3 customer segments that were more accurate and optimal than 8 customer segments which were shown through the calculation results of the SSE and CHI values that had been carried out. The drawback in this study is the determination of the centroid that still uses the random centroid method on K-Means, so it needs to be further developed related to the method for initial centroids, and not all clustering algorithms methods have been tested in this study.

CONCLUSION
Applying the RFM model integration and the K-Means cluster algorithm optimized by the Elbow method can be used for customer segmentation in retail companies. Applying the Elbow method to determine the optimal value of k clusters in clustering, so that the most optimal number of clusters is obtained and has the largest difference in SSE values, which is 25.829.39. The Elbow method was evaluated using the Callinski Harabaz Index method value of 36,625.89, and the results obtained were the same optimal cluster values. Thus, it can be concluded that the application of the Elbow method can be used as an optimization to determine the most optimal number of k cluster values in clustering.