CO and PM 10 Prediction Model based on Air Quality Index Considering Meteorological Factors in DKI Jakarta Using LSTM

. Purpose: This study aimed to make CO and PM 10 prediction models in DKI Jakarta using Long Short-Term Memory (LSTM) with and without meteorological variables, consisting of wind speed, solar radiation, air humidity, and air temperature to see how far these variables affect the model. Methods: The method chosen in this study is LSTM recurrent neural network as one of the best algorithms that perform better in predicting time series. The LSTM models in this study were used to compare the performance between modeling using meteorological factors and without meteorological factors. Result: The results show that the use of meteorological predictors in the CO prediction model has no effect on the model used, but the use of meteorological predictors influences the PM 10 prediction model. The prediction model with meteorological predictors produces a smaller Root Mean Square Error (RMSE) and stronger correlation coefficient than modeling without using meteorological predictors. Novelty: In this paper, a comparison between the prediction model of CO and PM 10 has been conducted with two scenarios, modeling with meteorological factors and modeling without meteorological factors. After the comparative analysis was done, it was found that the meteorological variables do not affect the CO index in 5 air quality monitoring stations in DKI Jakarta. It can be said that the level of CO pollutants tends to be influenced by factors other than meteorological factors.


INTRODUCTION
Indonesia ranked 9th out of 106 as the country with the worst air quality in the world in 2020, while Jakarta ranked 3rd most polluted city in Indonesia [1]. This makes air quality in Jakarta a problem that deserves serious attention. Currently, the officially used air quality standard in Indonesia is ISPU (Standard Air Pollution Index), where the calculation is carried out on seven parameters, namely PM10, PM2.5, NO2, SO2, CO, O3, and HC [2].
Of all parameters, the parameters that have a negative impact in a relatively small range are CO and PM10. In the range of 51-100, CO can cause changes in blood chemistry but has not been detected, while PM10 causes decreased invisibility. In the range 101-199, CO causes an increase in cardiovascular disease in smokers with heart disease, PM 10 causes a decrease in visibility and ubiquitous fouling. In the range of 200-299, CO causes cardiovascular increases in nonsmokers with heart disease, and some noticeable weakness will appear. Meanwhile, PM10 increase the sensitivity of patients with asthma and bronchitis [3].
One way to overcome the problem of air pollution in DKI Jakarta is to make temporal predictions of air quality using data from previous times. Creating a predictive model for each pollutant to predict the daily air quality index can create warnings for air quality. The prediction model can warn if the air quality has reached a certain level that harms health. It can also be used to control emissions, for example, to propose emission reductions, operational plans, or emergency response actions based on the results of existing predictions.
Several previous studies have been conducted to predict air quality. The study [7] has conducted research related to air quality predictions. It predicts the average number of hazardous substances in DKI Jakarta based on the air pollutant standard index using the Long Short-Term Memory (LSTM) method. This study's Mean Average Percentage Error (MAPE) was 12.28%. This study built the air quality standard indexes prediction model without considering the meteorological factors.
LSTM is an effective neural network model to predict time series [8]. The LSTM architecture is a particular type of RNN introduced by [9] to avoid long-term dependency problems in common RNNs [10]. Several studies have shown better LSTM performance than other methods in predicting time series. In addition, LSTM-RNN is suitable for making predictions on non-linear and non-stationary data [11], [12].
The study [13] has predicted air quality based on six meteorological factors, such as wind speed, cloud volume, air pressure, temperature, relative humidity, and precipitation, as input to predict air quality using Backpropagation (BP) Neural Network, LSTM, and Gated Recurrent Unit (GRU). The use of meteorological parameters in this study is based on feature selection results using entropy transfer. The model generated by the three algorithms in this study produces a good model by making a small RMSE, where LSTM produces better accuracy than the BP neural network and GRU. In [14] have also compared univariate and multivariate predictions using ARIMA and LSTM to predict the number of cases of HFMD (Hand, Foot, and Mouth Diseases). The results of this study indicate that the multivariate prediction model using LSTM produces better model performance than other models.
Based on the results obtained from previous studies, this study aims to create a PM10 and CO prediction model based on the air pollutant standard index in DKI Jakarta by comparing the prediction model with and without the meteorological predictors. The comparison aims to see how meteorological factors affect the model's performance in predicting CO and PM 10 based on the air pollutant standard index.
The meteorological variables used in this study consist of air temperature, relative humidity, solar radiation, and wind speed. This comparison aims to see how the meteorological predictors affect the model's performance. Due to the differences in ISPU distribution patterns, prediction models will be made for each air quality monitoring station in DKI Jakarta, namely DKI 1 station (Bundaran HI), DKI 2 (Kelapa Gading), DKI 3 (Jagakarsa), DKI 4 (Lubang Buaya) and DKI 5 (Kebon Jeruk). Furthermore, the model will be made for each air quality monitoring station in DKI Jakarta because of the different pollutant patterns in each region.

Materials
The study area and the data source in this research is DKI Jakarta, one of the areas with the highest air pollution levels in Indonesia. The data to be used are CO and PM10 based on ISPU and meteorological data from January 1, 2017, to March 31, 2021. The data to be processed is daily CO and PM10 from each air monitoring station in DKI Jakarta.
CO and PM10 data of each monitoring station are sourced from the DKI Jakarta Environment Service, which was downloaded from (https://data.jakarta.go.id/). Meanwhile, meteorological data from 3 Meteorology, Climatology, and Geophysical Agency stations in Jakarta, namely Kemayoran Station, Halim Perdana Kusuma Station, and Kemayoran Station, were downloaded from (https://dataonline.bmkg.go.id/).

Stage of Research
This research consists of several steps, namely data collection, data preprocessing, data partitioning, model making, and model evaluation. The stage of research can be seen in Figure 1.

Data Collection
The air pollution data of DKI Jakarta are provided by "Dinas Lingkungan Hidup" of DKI Jakarta Province at the website https://data.jakarta.go.id/. The meteorological data were downloaded from https://dataonline.bmkg.go.id/).

Data Preprocessing
Data preprocessing in this study was carried out by checking for missing values, integrating data between meteorological data and pollutant data, and the last was data normalization. Normalization is the process of assigning a scale to the attribute values of the data so that the data is within a specific range. This is important because the normalization process can prevent values that have an extensive range of values from dominating ones with a small range. In this study, the data of all variables were normalized using the minmax normalization of the data range of 0 and 1. The min-max normalization formula [15] can be seen in Equation (1).

Data Partitioning
The data that has gone through the preprocessing stage is divided into training and testing data. The proportions used are 80% and 20%, where 80% of the initial data is training data and 20% is testing data. The training data will be used to create the model. Meanwhile, the testing data will be used to evaluate the model.

Modelling Scenarios
The modeling process is carried out using two scenarios. The first scenario uses meteorological variables as predictors, and the second does without meteorological variables. 1. Prediction model with meteorological predictors: entered meteorological variables as input to predict the CO and PM10 in 5 air quality monitoring stations in DKI Jakarta. The meteorological variables used in this study consist of the average temperature, average humidity, and solar radiation. 2. Prediction model without meteorological predictors: performed to predict CO and PM10 without meteorological variables. This is a univariate modeling scenario, with predictions based on CO and PM10 data from the previous time.

Prediction Modeling using LSTM
The LSTM modeling stage is carried out to get a model that can predict CO and PM10 in 5 air quality monitoring stations in DKI Jakarta. This modelling aims to predict the ISPU of each station in DKI Jakarta by using meteorological factors that affect air quality and without meteorological variables. Hyperparameter initiation is chosen randomly at the modelling stage, and hyperparameter tuning is performed using grid search. Hyperparameter tuning is done to choose the best LSTM architecture from several values for each randomly selected hyperparameter. Initialization of parameters in this study is done randomly by determining the number of nodes in the input layer and output layer, optimizer, activation function, and learning rate and decay. The number of neurons in the LSTM layer to be tested in the grid search hyperparameter tuning is determined using Equation (2) [16]. The activation function used in this study is tanh which changes the range of data values from 1 to -1. Tanh activation function formula can be seen in Equation (3).
where: N i = numbers of input neuron N 0 = numbers of the output neuron N s = number of samples of train data α = degree of freedom 2-10 where: x = input data

Model Evaluation
The model evaluation stage includes the stages of testing and analyzing the testing data. Root Mean Square Error (RMSE) and correlation were used to evaluate the model's performance. Root Mean Square Error (RMSE) is an alternative method to assess the prediction technique used to measure the level of accuracy of a model [17]. RMSE is a technique that is easy to implement and has been frequently used in various studies related to forecasting [18]. The RMSE can be calculated using Equation (5).
Where: ỹ i = predicted value y i = actual value n = numbers of data Calculating the correlation coefficient is done using the built-in core function in Python. The correlation coefficient value can be calculated using Equation (6) where: S xy = covariance of the actual data and predicted value S x = standard deviation of the actual value S y = standard deviation of predicted value

RESULT AND DISCUSSION
The implementation of LSTM [8] was conducted using the Keras library in Python programming language. The LSTM architecture of PM10 and CO prediction model with and without meteorological predictor used in this research can be seen in Table 1. The number of input nodes in both modeling scenarios uses the number of independent variables as input, wherein modeling uses meteorological predictors; the input consists of meteorological variables such as air temperature, air humidity, solar radiation, wind speed, and pollutant index of the previous day. In comparison, the input node prediction model without using a meteorological predictor is one node, where CO and PM10 pollutant value is predicted based on the CO and PM10 index of the previous day. The activation function used on each LSTM layer is tanh. The output from the output layer is the pollutant index that has been predicted. Adaptive Moment Estimation (Adam) is the optimizer used in the architecture modelling of both model scenarios.
CO and PM10 prediction modelling on both data scenarios was conducted using grid search hyperparameter tuning to determine LSTM parameter values for modelling. Further, the tunning result using grid search in Table 2 presents the parameter combination that achieved optimal results in modelling. This research has built the prediction model of CO and PM10 based on air quality standard indexes in 5 air quality monitoring stations in DKI Jakarta using two modeling scenarios, with meteorological predictors and without meteorological predictors. The results of CO prediction modeling are shown in Table 3, and the results of PM10 prediction modeling are shown in Table 4.

Model Evaluation
After the model is built and used for each station, the model is evaluated using RMSE and correlation. The comparison of the evaluation of the two CO modeling scenarios can be seen in Figure 12. The best CO prediction model with lower RMSE and the highest correlation in each monitoring station is generated by the prediction model without meteorological predictors in DKI 1, DKI 3, DKI 4, and DKI 5. The RMSE generated for DKI 1 is 4.646, RMSE for DKI 3 is 6.762, RMSE for DKI 4 is 14.804, and RMSE for DKI 5 is 13.160. Meanwhile, at DKI 2, prediction models using meteorological predictors obtain smaller RMSE than those without meteorological predictors but get a lower correlation. The RMSE is 4.622.  Figure 4.6, the performance of the CO prediction model in each station is not influenced by the use of meteorological variables as inputs to predict the CO in several areas of DKI Jakarta. In DKI 3, DKI 4, and DKI 5, prediction models without meteorological predictors obtain smaller RMSE than modeling scenarios with meteorological predictors. However, the difference in RMSE values produced by the two modeling scenarios is insignificant. Figure 13 is a boxplot of the CO of 5 monitoring stations in DKI Jakarta, which shows the number and value of outliers in each station. From Figure 13, it can be seen that DKI 1, DKI 2, and DKI 3 have the least number of outliers, where the three stations produce smaller RMSE than DKI 4 and DKI 5.   Figure 13. Boxplot of CO at the five stations Furthermore, the comparison of the PM10 prediction model in each station using the two modeling scenarios can be seen in Figure 14. The best prediction model with a smaller RMSE value and higher correlation in each station is produced by the PM10 prediction model with meteorological predictors. The difference in the RMSE value for the PM10 prediction model is also influenced by the number of outliers found. For example, there are no outliers in the PM10 data at DKI 2, and there are only a few outliers in the PM10 data at DKI 1 and DKI 5. Meanwhile, DKI 3 and DKI 4 have many outliers, as shown in the boxplot of the PM10 in Figure 15. Based on correlation analysis with meteorological variables, the CO does not strongly correlate with meteorological variables. The highest correlation between the CO and meteorological variables is with ff_avg or wind speed at DKI 2, which is -0.24. The absence of a strong correlation between the CO and the meteorological variables makes the results of the prediction model of the CO with and without 8 Figure 16. Correlation matrix between CO and PM10 with meteorological variables in each SPKU in DKI Jakarta Prediction models for CO and PM10 indices at five air quality monitoring stations in DKI Jakarta were successfully created using LSTM [8]. The results show that the use of meteorological predictors does not affect the CO prediction model's performance, but the use of meteorological predictors influences the PM10 prediction model. As a result, the RMSE obtained by this scenario is smaller on each monitoring station. The correlation between the predicted results and the actual value in the training data is stronger than the modeling without meteorological predictors.
A reasonably strong correlation between the PM 10 pollutant index and the meteorological variables used as predictors influences the model's performance. Meanwhile, the CO index does not strongly correlate with meteorological variables. In other studies, air temperature, humidity, and wind speed negatively correlate with PM10 concentrations [4]. Solar radiation, rainfall, humidity, and hotspots are related to PM10 [5]. But, in our research finds that 70% of CO pollutant sources are influenced by motor vehicle emissions. So, it is hoped that the government can overcome the surge in motor vehicles in DKI Jakarta to prevent air pollution.

CONCLUSION
This study has successfully built a prediction model for CO and PM10 in 5 air quality monitoring stations in DKI Jakarta using LSTM. The modeling is carried out using two scenarios. The first is the LSTM modeling with meteorological predictors, and the second is without meteorological predictors. The results show that meteorological predictors in the CO prediction model do not affect the model. Still, the use of meteorological predictors influences the PM10 prediction model. Prediction models using meteorological predictors produce smaller RMSE and more robust correlation coefficients for PM10 modeling. This occurs because there is a stronger correlation between PM10 and meteorological variables than CO. The meteorological variables do not affect the CO index in 5 air quality monitoring stations in DKI Jakarta. It indicates that the level of CO pollutants tends to be influenced by human activities, for example, motorized vehicles, which are the largest source of CO concentration levels. This model can only predict the CO and PM10 air pollutant standard index for the next day in each air quality monitoring station in DKI Jakarta. Based on the result found in this research, further research can add other factors that have a higher correlation with the CO and PM10 of each SPKU, such as human factors and human activities. Furthermore, this study uses daily data of CO and PM10, which has many outliers in the data, so further research can use hourly data to improve the model's performance and add more lags to predict further.