Topic Modeling on WhatsApp User Reviews Using Latent Dirichlet Allocation

. Purpose: Topic modeling is a practical algorithm for identifying topics in text data. This study aims to find issues of WhatsApp user reviews using Latent Dirichlet Allocation (LDA) and describe the characteristics of each case. Method: We used 1710 WhatsApp user reviews written 7-13 August 2020 on Google Play. This research was conducted with a qualitative method consisting of five stages: problem identification, data retrieval, preprocessing, modeling, and analysis. The modeling stage consists of making a Document-Term Matrix (DTM), determining the number of iterations and topics, and building a model. We use perplexity as to the indicator in determining the number of iterations and topics. A lower perplexity value indicates a better model performance. The analysis phase includes observations on the top terms and documents to label and describe the characteristics of each topic. Result: Topic modeling produces word-topic and document-topic assignments. The word-topic assignment contains words with high probability (top terms). Document-topic assignment reveals documents that have a high probability (top documents). The topics most frequently discussed were voice and video calls with 104 reviews, 86 reviews of call quality, photo and video quality with 100 reviews, and voice messages with 75 reviews. Novelty: In this research, a topic model has been generated for a user review of the WhatsApp application using Latent Dirichlet Allocation. The number of iterations in the modeling was determined based on the observation of the perplexity value, instead of randomly assigning iterations.


INTRODUCTION
Since the widespread use of the internet in the 1990s, human activities have generated large amounts of data [1]. The data contains varied information and the volume increases rapidly in a relatively short time. Such data phenomena are often referred to as big data [2]. The problem of how to utilize the data is the background for the birth of data mining. Han & Kamber [3] call this data mining as a process of finding knowledge from large-scale data. Mining work with text data is called text mining [4]. Text mining is defined as the process of obtaining information from an unstructured text database [5]. Text mining provides various solutions for processing, organizing, and analyzing large amounts of unstructured text [6].
Learning algorithms in text mining are divided into supervised and unsupervised learning. In supervised learning, each record in the data set has a label. Unlike supervised learning, unsupervised learning algorithms do not have labels on their data sets. One of the unsupervised learning algorithms is clustering. In a collection of documents, this clustering method allows faster retrieval of information because the document will be checked automatically [7]. One of the popular clustering methods is topic modeling or topic modeling. Topic modeling works by dividing (grouping) the entire input data into several groups based on the similarity of terms owned by each document. A researcher or analyst may not want to work with all of the data, but only take some of the data on a particular topic. By categorizing the corpus into smaller sub-corpuses based on the topic, data analysis can be carried out in a more focused and efficient manner as needed. In addition, the processed data will be lighter (less) so the algorithm can work faster. There have been many studies on the topic of modeling covering various techniques and problems. Esmaeili et al. [8] apply topic modeling to study the topic of a review text based on certain aspect groups using Variational Aspect-based Latent Topic Allocation (VALTA). Other techniques, such as Natural Language *Corresponding author. Email addresses: iqbalkharisudin@mail.unnes.ac.id (Kharisudin), hera.masrian@students.unnes.ac.id (Masri'an) DOI: 10.15294/sji.v9i1.34941 Processing (NLP) are also used to extract information in the form of a summary of the topics discussed in a collection of health product review texts [9]. Besides VALTA and NLP, the technique that is often used is Latent Dirichlet Allocation (LDA). In general, LDA can identify topics (often discussed) in text data [10]. In VALTA and NLP, the number of topics is determined based on several conditions involving the subjective opinion of humans (researchers). This can lead to inconsistent results because human intuition and understanding differ. Determination of the number of topics in the LDA is based on a value called perplexity. After this value is obtained, a researcher only needs to observe the candidate number of topics that displays the lowest perplexity value [11]. Thus, the results obtained will be consistent even though it is done by different people. In addition to these advantages, the LDA technique has also been proven to be able to analyze text from large documents [12].
LDA has been widely applied in various studies, such as by Annisa et al. [13] who use LDA to study information in a hotel visitor review. In the research results, topic modeling can identify eight topics along with the words that make up those topics. Another study using LDA was conducted by Sutherland et al. [14] by successfully describing topics about the quality of accommodation services in tourism based on customer reviews. LDA can also be applied to find out trending topics in a collection of customer complaints [15] and identify topics from a collection of online news titles [16]. In politics, LDA can be used to analyze political issues such as the presidential election through Twitter posts [17]. For the health care sector, topic modeling with LDA can provide recommendations for the best doctors to treat certain diseases based on reviews or experiences from previous patients [18]. LDA can also be applied in software engineering, such as the detection of malicious android applications based on their description [19].
Smartphone application development companies certainly want to satisfy users of their products and services. Therefore, improvements and innovations are continuously made to meet user demands. This needs to be done because the quality of a product and service affects customer/user satisfaction [20]. To find out complaints and feedback, the service provider provides the opportunity for users to write their personal reviews of the services or products that have been used. User reviews can be used as material for analysis so that useful information is obtained. One of the most popular applications is WhatsApp, which is now used by more than two billion people in 180 countries. The developer uses such a large number of users to obtain information to improve service quality by collecting user criticism and suggestions. Quoting from Google, there have been more than one hundred million user reviews and this number is growing. Such user reviews can be useful information if studied. However, the number of one hundred million is not small and of course impossible if it is checked manually. What's more, the reviews written cover some different topics. With so many reviews, it's easier to group them by topic first. Therefore, text mining is needed to identify and analyze the topic of the review more quickly.
Based on this background, the purpose of this research is to develop and find a suitable topic model for WhatsApp user reviews. This study uses Latent Dirichlet Allocation (LDA), which is one of the topic modeling techniques that utilize a probabilistic generative model for discrete data sets in the form of text, where each item in the collection is not only on one topic, but has a probability of membership in several topics [11]. Topic modeling can help identify what topics appear in a collection of texts [21], so it can be seen the things that are often mentioned or complained about from the WhatsApp application service. The selection of WhatsApp as the object of research is expected to be a material consideration for other companies in developing products that are being pioneered, considering that the application is prevalent. Before modeling, the number of iterations and topics will be determined for the model to be built. After obtaining the topics for each document, an analysis of the terms that appear in each topic will be carried out. The study was carried out to obtain information in the form of topic descriptions that could be understood more easily. The results of the analysis of topic modeling using LDA are expected to provide information about WhatsApp users in a brief review often discuss topics.

METHODS
This research was conducted using a qualitative method consisting of five stages (as shown in Figure 1), namely problem identification, data collection, preprocessing, modeling, and analysis of results. At the problem identification stage, research is carried out on the data to be used and the specifications of the appropriate research tools for data collection. The method used to obtain the data is web scraping. This method was chosen because it can handle the problem of retrieving data from online sources [22]. Web scraping requires a stable internet connection but can work faster and more efficiently than regular copypaste [23]. The R and RStudio application were used to assist data retrieval and processing in this study.
The preprocessing stage aims to organize and prepare data to match the application library used. The data used in this study is the text of WhatsApp user reviews in English and accessed via Google Chrome. Based on the scraping results, 11,960 reviews were obtained which included username, date of review, rating or star, and review text. In order to limit the topic of the review not being too broad, 1,710 reviews were selected that were written on 7 -13 August 2020. Names, stars, and dates of review will not be used in this study. The data obtained from the scraping process is saved in .csv format. To simplify the process of organizing, the format was changed to .xls. Emojis and emoticons are automatically converted to UTF-8 format. The LDA algorithm requires input in the form of a Document-Term Matrix (DTM) with a weighting of Term Frequency (TF), which is the default weighting for clustering algorithms. Data from the previous stage will be converted as a corpus and processed into DTM using the package tm. The process of calculating the model in this study mainly uses the functions contained in the package topic models and quanteda, the rest is helped by the package tidy verse for processing graph plots, and wordcloud2 in the visualization section. Before working on the main algorithm, the number of topics and iterations must be determined and validated. An optimum model can be produced which is considered good enough to represent the existing topics [22]. The process of determining the candidate number of iterations and topics begins by dividing the DTM into two parts, namely training and testing data. Next, a model will be built using training data and applied to testing data [24]. LDA is one of the unsupervised learning which consists of a probabilistic generative model [25]. This generative process defines a joint probability distribution of random variables, both observed and hidden variables. The combined distribution aims to calculate the conditional distribution of the hidden variables obtained from the observed variables. This conditional distribution is often called the posterior distribution. The computational problem in deducing the topic structure of a document is how to calculate the posterior distribution [26].Te 2. LDA generative process The graphic model in Figure 2 [27] is a representation of the generative process performed by the LDA algorithm. The shaded circle is an observed variable, while the unshaded circle is a hidden (latent) variable. Parameters determine the distribution of topics in the corpus. The larger the value, the corpus contains a wider mix of topics. The parameter determines the distribution of words in the topic. A high value indicates more word mixes in the word distribution ( ). Variable represent the distribution of topics in each document. The larger the value, the more topics that appear in the document. The variable declares the topic for a document (collection of words). Meanwhile, the variable represents a term. The outer box represents an iterative process for determining the proportion of document topics [12]. Parameters and are parameters of the corpus level with the assumption that sampling is taken during the process of compiling the corpus. Variable and resides at the document level with sampling in each document. Then, the variables and are at the term level and are sampled once for each word in each document [11]. With regard to the generative processes in the LDA, Grün & Hornik [28] describe it as follows.
1) Choosing the term distribution for each topic with ~ Dirichlet( ); Each topic can be defined as the probability distribution for each word in the vocabulary as many as . Technically, for each topic = {1,2, … , }, we can define the value of a Dirichlet distribution with the main parameter . 2) Choosing the proportion of topic distribution for each document, ~ Dirichlet( ); Each document in the corpus can be expressed as a document probability distribution for each topic ( ). This distribution is obtained from the Dirichlet distribution with the prior parameter . 3) Furthermore, for each document and word of , a) Choose a topic from the topic-document distribution ( ), ~ Multinomial( ). b) Choose a word from the word distribution ( ), with as taken from the previous step, ( | , ).
According to Blei, et al [11], the above generative process can be explained by several equations. If there are parameters and , then the joint distribution of a mixture of topic proportions , word proportions , sets of topic, and word sets in the document , is written in equation (1), where: ( | ) only for with a unique such that = 1; The ( | ) value in equation (2) below is a Dirichlet distribution which gives the probability density value for a random variable with dimension = {1,2, … , }, parameter > 0, and Γ( ) is a Gamma function . ( By integrating equation (1) with respect to and adding each , we get the marginal distribution for a single document in equation (3). Then, by taking the product of the marginal probability of a single document from equation (3) above, the probability of a corpus is obtained with equation (4), indicates the number of topics, is the number of documents, represents the number of terms [11] In computational problems, conclusions about the structure of the topic are carried out by calculating the posterior distribution in equation (5) [11], or equation (6) [29] below. The LDA algorithm assumes that words will be generalized based on topics (fixed conditional distribution) and these topics can be exchanged infinitely in a document. Based on de Finetti's theorem, the probabilities of the word sequences and topics are obtained through the following equation (7). de Finetti's theorem states that the combined distribution of infinitely exchangeable sequences of variables is a random parameter taken from several distributions with independent variables and identically distributed. An infinite sequence of random variables is said to be infinitely exchangeable if every finite subsequence is exchangeable. A finite set of random variables { 1 , 2 , … , } is exchangeable if its combined distribution does not change in its permutations. If is the permutation is an integer from 1 to , then ( 1 , 2 , … , ) = ( (1) , … , ( ) ) [11].
To find topics in the corpus = ( 1 , 2 , … , ), where each word is in several document , an estimate must be obtained that can produce a high probability value for the words that appear. One way to get this estimate is to try to maximize ( | , ) in equation (3) using the Expectation-Maximization (EM) algorithm [30] to find the maximum likelihood estimate of and in the following equation (8) [28]. Based on the explanation of Griffiths & Steyvers [29], the calculation in equation (5) will involve the process of evaluating the probability distribution. Although ( | ) can be calculated for each , but the evaluation process will require too large a space. Therefore, we need a method in the form of sampling on the distribution of objectives using a Monte-Carlo Markov chain, sampling from a Markov chain that converges with the target distribution. Each value of the Markov chain is a value assignment for the sampled variable. By using Gibbs sampling, the following values will be obtained by sequentially sampling all variables from their respective distributions. In the LDA model using Gibbs sampling, the posterior distribution is obtained by equation (9)  Next, the determination of the number of iterations is carried out through model formation experiments with several different values chosen at random. For each model and each value, perplexity is calculated and recorded, then plotted. From the plot, the trend of the value will be visually observed. The number of iterations chosen is the value or the earliest point at which a stable value trend begins to appear [12]. After determining the number of iterations, the LDA model is rebuilt to determine the candidate number of topics. The model development uses pre-selected multiple iteration inputs and is formed for several multi-topic candidates with a value in a certain range. From this model, the perplexity value of each candidate topic is obtained. Perplexity is a standard measure for topic modeling, which measures the model's ability to use training documents [12]. This method works by training several hidden variable models on two corpus of texts to compare the generalization performance of the model and the aim is to estimate the density in the hope of achieving a high likelihood value in the tests carried out. Perplexity monotonically decreases the likelihood value of the test data, and from an algebraic point of view, it is equivalent to the inverse of the geometric mean for the likelihood value of each word [11]. If is a document, is stating the number of words in the document, is ℎ word in the document and is stating the number of words in the calculated sequence, then the probability value for the sequence of words that appear together is written in the following equation Equation (10) referred to as the -gram model is used as the basis for calculating the likelihood value in equation (8) and also in the preparation of equation (3). The model in equation (10) is a statistical-based language model that is used to determine the probability value of word order appearing together from a training document [31]. For example, to get the probability value of ( | −2 , −1 ) in the trigram (3gram) case, using the formula as shown in equation (11).
In order to measure the performance of a language model (model fitting), the theory of quantity of information or entropy is used [31]. This entropy value will be related to the perplexity value. is the estimated entropy value that describes the probability that the model can meet the testing data. The lower the value of , the resulting model is considered better. Furthermore, the perplexity value for a document using the -gram language model approach is defined as equation (12) [31]. With some conditions and assumptions in the LDA algorithm, in a test set involving a corpus with documents in it, the perplexity value of a collection of documents is obtained by modifying the above equation into this equation (13), where declaring the ℎ document into a corpus ; is the number of words in each document, and ℓ( , ) is the likelihood as in equation (8) [11].
The model is validated using 10-folds cross-validation to choose which candidate produces a good model, namely the model with the lowest relative perplexity value [21]. The 10-folds cross-validation is a validation method by breaking the training data into 10 parts and shuffling them at different positions ( Figure 3 shows the illustration of the 5-folds cross-validation). This is done to reduce the possibility of the model overfitting (the model works too well) on the training data so that it has better performance when the model is applied to new data [7].  The process is continued by calculating the LDA model for the entire DTM by inputting the value of the number of iterations and topics selected in the previous process, and using the Gibbs sampling method. At this stage, the posterior distribution for the data will also be calculated. From the posterior distribution, the values of the word probability distribution on the topic will be obtained (matrix ) and the proportion of topics for each document (matrix ). The computational inference process is carried out to determine the membership of the document. From the modeling results obtained, further analysis will be carried out to reveal the words that make up a topic. By considering the probability distribution of each loaded word (matrix ), each topic can be assigned a description such as a topic name or label. In addition, an analysis was also carried out to identify the membership status of each document against the resulting topic modeling, so that each document can be labeled according to the probability distribution of the emerging topics (matrix ). Furthermore, other descriptions can also be added to explain each processed data result. By conducting this analysis, recommendations and insights related to the modeling results will be obtained.

RESULT AND DISCUSSION
After the data has been obtained, the next step is to organize the data through the preprocessing stage. This stage includes removing punctuation, removing numbers, lowercase, lemmatization or stemming, removing stopwords, and stripping white space. The results obtained in this step can be seen for example in Table 1. "It s a good app but one problem is that why can t we delete some photo for everyone I mean we can delete messages for everyone but not some document or photo U 0001F914 U 0001F47F I am really angry with this app " Remove number "It s a good app but one problem is that why can t we delete some photo for everyone I mean we can delete messages for everyone but not some document or photo U F U FF I am really angry with this app " Lowercase "it s a good app but one problem is that why can t we delete some photo for everyone i mean we can delete messages for everyone but not some document or photo u f u ff i am really angry with this app " Lemmatization "it s a good app but one problem be that why can t we delete some photo for everyone i mean we can delete message for everyone but not some document or photo u f u ff i be really angry with this app" Remove stopword " delete photo delete message document photo angry " Strip white space "delete photo delete message document photo angry" From the remove punctuation and remove number steps, every punctuation and number has been removed and replaced with a space. The next step is to convert each uppercase character to lowercase. Lemmatization or stemming aims to change each word into its basic form (lemma). In the remove stopword step, any words that do not have a special thematic meaning will be removed and replaced with spaces. Unnecessary spaces will then be removed in the white space strip step. A total of 1,710 documents that have passed the preprocessing stage (done with RStudio) will be converted into DTM. The DTM obtained are in the form of a 1710 × 2038 matrix with TF weighting. The weighting presents the term weight in each document as the number of occurrences of the term in a document. The description can be seen in Figure 5 (left). Each term registered in the DTM is unique, that is, no terms are the same in each column. The sparsity value is between 0 and 1, which in DTM is displayed as a percentage. The closer the value is to 1 (100%), the more terms appear in only a few documents (1 document, or even 0). Some of these terms are words that have typing errors. Because the term has no meaningful meaning, it will be removed with Remove Sparse Terms. The description of the repaired DTM can be seen in Figure 5 (right). The 1,260 ℎ document only contains the word "somethings", so that the document is no longer listed in the repair DTM. As a result, the sparsity becomes 99%, with 1,145 terms are removed, leaving 893 terms and 1,709 documents in the DTM. The DTM obtained from the previous step is then divided into two parts of data with a proportion of 50%, training data as many as 854 documents, and the remaining 855 documents as testing data. The candidate topics chosen for the model calculations are 15, 20, 25, 30, and 35. For each of these candidate topics, the perplexity value of the model is calculated in the iteration range from 1 to 1,500 in order to ensure clarity on a wide range of observations. Perplexity calculation is done by running the LDA algorithm in the topicmodels package. Training data is used as input for model making, then model fitting is carried out with testing data input. With the device in use, this process takes 15 hours and 28 minutes (928 minutes). The determination of the number of iterations is done by observing the perplexity value. To facilitate the observation of value trends, a smooth line is drawn based on the distribution of perplexity values for each number of topics in Figure 6. In Figure 6, there are five candidate numbers of topics assigned different colors (15=black, 20=green, 25=cyan, 30=red, 35=blue). Each point represents the perplexity value for each model according to the candidate number of topics used. From these points, a smooth regression line is then made with a color that is adjusted to each model. On the left, it can be seen that the perplexity value is higher at the beginning of the graph, the more to the right (the larger number of iterations) the value decreases, and then starting at a certain point these values form a trend (stable pattern). The number of iterations selected is the earliest value at which a stable trend of values begins to appear in the overall topic. From Figure 6 it can be seen that the line begins to show a monotonous rate before the 500 iterations. Then, from the observation of the graph in Figure 6 (left) which is zoomed to Figure 6 (right), the number of iterations chosen is 300.
After the number of iterations are selected, we will continue with the determination number of topics. This stage uses two parts of data from DTM with proportions of 50%. Each of these data is training and testing data that has been used in the previous step. The calculation process through 10-fold cross-validation starts from breaking and randomizing the train set into 10 parts, followed by fitting the valid set. The number of iterations that have been entered is 300. The candidate topics ( ) included in the calculation are from 2 to 100. This process produces a total of 990 records whose values will be visualized as graphs. Figure 7 shows a plot of the perplexity values obtained for each fold. For each candidate, the number of topics will have 10 points that represent the perplexity value of each fold. The blue line in the figure is a smooth regression line with respect to all points. After getting the calculation results, the plotting is again carried out to observe the perplexity value. From Figure   After determining the number of iterations and topics, the next step is to form a final model using these parameters. Modeling using the FitLdaModel function in the package textmineR. The modeling was run by entering parameters of 300 iterations and 31 topics (which were selected in the previous step). In the process of forming this model, two results that were obtained is a word-topic assignment ( ; phi) and documenttopic assignment ( ; theta). (1) Word-topic assignment is one of the model results whose representation can be viewed as a matrix with the number of terms as the row size and the number of topics as the column size. For this result, a word-topic assignment of size 893×31; 893 terms and 31 topics. The sum of the probabilities of all words in each column is equal to 1. Each ( , ) entry in the word-topic assignment states the conditional probability of the ℎ word to be included in the ℎ topic. A word has a different probability on each topic. The greatest probability value in each line (for a word) states in which topic the word is most suitable. Meanwhile, the values in one column indicate the probability proportion of words that make up the topic model in that column. The greater the probability of a word in a column indicates that the word is more interpretive in explaining the latent characteristics of a topic compared to words with a smaller probability proportion. By sorting the proportion of the probability values of each word in each column (via Excel), the top terms with the greatest probability are obtained that can describe the characteristics of the related topic. (2) Document-topic assignment is similar to word-topic assignment, except that the line element that previously displayed the word is now replaced with a document number, so the documenttopic assignment matrix is sized 1709×31; 1709 documents and 31 topics. Each entry in the ℎ line of the document-topic assignment represents the conditional probability of each topic to appear in the document . Meanwhile, each ℎ column entry describes the proportion of the probability that a document can be in the topic . The larger the proportion of a document, then the document is most suitable to be on the topic . By taking the maximum probability of a document in one line (using Excel), a topic assignment for each document is obtained; the topic where the document is most suitable to be located. The documents with the highest probability on each topic are also collected as top documents. Furthermore, we will describe the characteristics of each modeling topic. One of the investigations carried out is to reobserve the proportion of top terms and top documents as a consideration to provide an intuitive description and interpretation of each topic. The labels on the 31 topics above are briefly listed in Table 2. Interesting insights and recommendations are obtained from the table. Topics 5 and 6 can be used by the developer to determine the performance of the story feature. The reviews in topics 3, 16, 9, 1, and 11 can be used to evaluate how the calling and messaging feature performs. Voice and video (104 reviews), call quality (86 reviews), photos and videos (100 reviews), and voice messages were the most frequently discussed topics (75 reviews). The proportion of the maximum topic probability in the document is used to determine the topic assignment for a document. There is still a chance that the document will be assigned to another topic. Although it does not fully interpret the contents of the document, the maximum probability proportion can be considered the most likely topic based on the contents of the document. The proportion of top terms and top documents on each topic is used to label topics. This is done in order to obtain an overview and brief description of the subject. This can help developers to obtain information and recommendation more easily.

CONCLUSION
In this research, topic modeling has been carried out on user reviews of the WhatsApp application using Latent Dirichlet Allocation (LDA). The stages include: (1) The process of identifying the problem and taking the necessary data; (2) Preprocessing; (3) Determining the number of iterations through perplexity value investigation and selecting the number of topics with cross-validation. Determination of the number of iteration parameters by perplexity is a method that has not been widely used, where most studies give the number of iterations randomly; (4) Run the model using the number of iterations and the selected topic. This stage produces word-topic and document-topic assignments; (5) Furthermore, by observing several top terms and top documents on each topic, a description in the form of topic labels has been obtained that can interpret each topic intuitively. Top terms are used to determine the characteristics of the topic, then these findings are matched with the top documents in the topic. Broadly speaking, the topics covered in the data are calls, multimedia features, messaging, storage, and security issues. In this study, the number of topics is preceded by testing the parameters of the number of iterations through calculating perplexity, then the selected number of iterations will be used in determining the number of topics through cross-validation testing. In several other studies using LDA, the methods for determining the number of topics are quite diverse. Therefore, in future research, it is possible to compare the methods for determining the number of topics in LDA; which method yields the most interpretive topic. In addition, it can be added by measuring the accuracy level of the obtained topic modeling results.