A Comparative Study of Random Forest and Double Random Forest Models from View Points of Their Interpretability

Adlina Khairunnisa(1), Khairil Anwar Notodiputro(2), Bagus Sartono(3),


(1) Faculty of Mathematics and Natural Sciences, Institut Pertanian Bogor, Indonesia
(2) Faculty of Mathematics and Natural Sciences, Institut Pertanian Bogor, Indonesia
(3) Faculty of Mathematics and Natural Sciences, Institut Pertanian Bogor, Indonesia

Abstract

Purpose: This study aims to compare the performance of ensemble trees such as Random Forest (RF) and Double Random Forest (DRF) from view points of interpretability of the models. Both models have strong predictive performance but the inner working of the models is not human understandable. Model interpretability is required to explain the relationship between the predictors and the response. We apply association rules to simplify the essence of the models.

Methods: This study compares interpretability of RF and DRF using association rules. Each decision tree formed from each model is converted into if-then rules by following the path from root node to leaf nodes. The data was selected in such a way that they were underfit data. This is due to the fact that DRF has been shown by other researchers to overcome the underfitting problem faced by RF. A Simulation study has been conducted to evaluate the extracted rules from RF and DRF. The rules extracted from both models are compared in terms of model interpretability based on support and confidence values. Association rules may also be applied to identify the characteristics of poor people who are working in Yogyakarta.

Result: The simulation results revealed that the interpretability of DRF outperformed RF especially in the case of modelling underfit data.  On the other hand, using empirical data we have been able to characterize the profile of poor people who are working in Yogyakarta based on the most frequent rules.

Novelty: Research on interpretable DRF is still rare, especially the interpretation model using association rules. Previous studies focused only on interpreting the random forest model using association rules. In this study, the rules extracted from the random forest and double random forest models are compared based on the quality of the rules extracted.

Keywords

Interpretability; Association rules; Rule extraction; Double random forest

Full Text:

PDF

References

T. Hastie, R. Tibshirani, J. H. Friedman, and J. H. Friedman, The elements of statistical learning: data mining, inference, and prediction, vol. 2. Springer, 2009.

L. Breiman, J. Friedman, R. Olshen, and C. Stone, ‘Classification and Regression Trees. Wadsworth & Brooks/Cole’, Advanced Books & Software, 1984.

E. Belli and S. Vantini, ‘Measure inducing classification and regression trees for functional data’, Statistical Analysis and Data Mining: The ASA Data Science Journal, vol. 15, no. 5, pp. 553–569, 2022.

L. Breiman, ‘Random forests’, Machine learning, vol. 45, no. 1, pp. 5–32, 2001.

J. Ali, R. Khan, N. Ahmad, and I. Maqsood, ‘Random Forests and Decision Trees’, International Journal of Computer Science Issues(IJCSI), vol. 9, Sep. 2012.

A. More and D. P. Rana, ‘Review of random forest classification techniques to resolve data imbalance’, in 2017 1st International Conference on Intelligent Systems and Information Management (ICISIM), IEEE, 2017, pp. 72–78.

S. Han, H. Kim, and Y.-S. Lee, ‘Double random forest’, Machine Learning, vol. 109, no. 8, pp. 1569–1586, Aug. 2020, doi: 10.1007/s10994-020-05889-1.

A. N. A. Aldania, A. M. Soleh, K. A. Notodiputro, and others, ‘A Comparative Study of CatBoost and Double Random Forest for Multi-class Classification’, Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 7, no. 1, pp. 129–137, 2023.

L. Auret and C. Aldrich, ‘Empirical comparison of tree ensemble variable importance measures’, Chemometrics and Intelligent Laboratory Systems, vol. 105, no. 2, pp. 157–170, 2011, doi: 10.1016/j.chemolab.2010.12.004.

F. Doshi-Velez and B. Kim, ‘Towards a rigorous science of interpretable machine learning’, arXiv preprint arXiv:1702.08608, 2017.

O. Bastani, C. Kim, and H. Bastani, ‘Interpreting blackbox models via model extraction’, arXiv preprint arXiv:1705.08504, 2017.

H. Deng, ‘Interpreting tree ensembles with inTrees’, International Journal of Data Science and Analytics, vol. 7, no. 4, pp. 277–287, Jun. 2019, doi: 10.1007/s41060-018-0144-8.

L. Radebe, D. C. M. van der Kaay, J. D. Wasserman, and A. Goldenberg, ‘Predicting Malignancy in Pediatric Thyroid Nodules: Early Experience With Machine Learning for Clinical Decision Support’, The Journal of Clinical Endocrinology & Metabolism, vol. 106, no. 12, pp. e5236–e5246, Dec. 2021, doi: 10.1210/clinem/dgab435.

P. Cadahia Delgado, E. Congregado, A. Golpe, and J. C. Vides, ‘The Yield Curve as a Recession Leading Indicator. An Application for Gradient Boosting and Random Forest’, International Journal of Interactive Multimedia and Artificial Intelligence, vol. 7, pp. 7–19, Mar. 2022, doi: 10.9781/ijimai.2022.02.006.

H. Ilma, K. Notodiputro, and B. Sartono, ‘Association Rules in Random Forest for the most Interpretable Model’, BAREKENG: Jurnal Ilmu Matematika dan Terapan, vol. 17, no. 1, pp. 0185–0196, Apr. 2023, doi: 10.30598/barekengvol17iss1pp0185-0196.

Badan Pusat Statistik, Indikator Kesejahteraan Rakyat 2022. Jakarta: [BPS] Badan Pusat Statistik, 2022.

Badan Pusat Statistik, Keadaan Angkatan Kerja di Indonesia Agustus 2022. Jakarta: [BPS] Badan Pusat Statistik, 2022.

K. C.-K. Cheung and K.-L. Chou, ‘Working Poor in Hong Kong’, Social Indicators Research, vol. 129, no. 1, pp. 317–335, Oct. 2016, doi: 10.1007/s11205-015-1104-5.

F. Faharuddin and D. Endrawati, ‘Determinants of working poverty in Indonesia’, Journal of Economics and Development, vol. 24, no. 3, pp. 230–246, Jan. 2022, doi: 10.1108/JED-09-2021-0151.

A. Takemura and K. Inoue, ‘Generating explainable rule sets from tree-ensemble learning methods by answer set programming’, arXiv preprint arXiv:2109.08290, 2021.

R. Agrawal, R. Srikant, and others, ‘Fast algorithms for mining association rules’, in Proc. 20th int. conf. very large data bases, VLDB, Santiago, Chile, 1994, pp. 487–499.

J. Han, M. Kamber, and J. Pei, Data Mining Concepts and Techniques, Third Edition. Boston: Morgan Kaufmann, 2012.

A. Liaw, M. Wiener, and others, ‘Classification and regression by randomForest’, R news, vol. 2, no. 3, pp. 18–22, 2002.

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, ‘SMOTE: synthetic minority over-sampling technique’, Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002.

P. Probst, M. N. Wright, and A.-L. Boulesteix, ‘Hyperparameters and tuning strategies for random forest’, Wiley Interdisciplinary Reviews: data mining and knowledge discovery, vol. 9, no. 3, p. e1301, 2019.

F. Gorunescu, Data Mining: Concepts, models and techniques, vol. 12. Springer Science & Business Media, 2011.

Refbacks

  • There are currently no refbacks.




Scientific Journal of Informatics (SJI)
p-ISSN 2407-7658 | e-ISSN 2460-0040
Published By Department of Computer Science Universitas Negeri Semarang
Website: https://journal.unnes.ac.id/nju/index.php/sji
Email: [email protected]

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.