Klasifikasi Data Tak Seimbang Menggunakan Algoritma Random Forest dengan SMOTE dan SMOTE-ENN (Studi Kasus pada Data Stunting)
(Studi Kasus pada Data Stunting)
DOI:
https://doi.org/10.30989/teknomatika.v17i2.1530Keywords:
Klasifikasi, Imbalanced Class, Random Forest, SMOTE, SMOTE-ENNAbstract
Algoritma random forest merupakan salah satu metode klasifikasi pembelajaran mesin yang banyak digunakan karena memiliki keunggulan dalam mengurangi resiko overfitting sekaligus meningkatkan kinerja prediksi secara umum. Namun untuk data dengan kelas tidak seimbang, algoritma ini tidak mampu mencapai performa maksimal khususnya dalam memprediksi data pada kelas minoritas. Untuk itu artikel ini menawarkan dua metode resampling untuk menyeimbangkan data, yaitu Synthetic Minority Oversampling Technique (SMOTE) dan Synthetic Minority Oversampling Technique with Edited Nearest Neighbors (SMOTE-ENN). Untuk klasifikasi data diterapkan algoritma random forest terhadap data asli dan hasil resampling baik menggunakan SMOTE maupun SMOTE-ENN. Studi kasus diterapkan pada data stunting yang berjumlah 421 pada kelas mayoritas dan 79 pada kelas minoritas. Diperoleh akurasi 89% pada data asli, 90% pada data hasil resampling dengan SMOTE-ENN, dan 91% pada data resampling dengan SMOTE. Walaupun tidak terlalu signifikan, teknik resampling dengan SMOTE memberikan akurasi terbaik.
References
[1] R. Hitman et al., “Penyuluhan Pencegahan Stunting pada Anak (Stunting Prevention Expansion in Children),” Communnity Development Journal, vol. 2, no. 3, 2021.
[2] E. Lestari, Z. Shaluhiyah, and M. Sakundarno Adi, “MPPKI Media Publikasi Promosi Kesehatan Indonesia,” vol. 6, no. 2, 2023, doi: 10.31934/mppki.v2i3.
[3] Unicef, WHO, and World Bank, level and trends in child malnutrition 2023. 2023.
[4] Rokom, “Prevalensi Stunting di Indonesia Turun ke 21,6% dari 24.4%.” Accessed: May 17, 2024. [Online]. Available: https://sehatnegeriku.kemkes.go.id/baca/rilis-media/20230125/3142280/prevalensi-stunting-di-indonesia-turun-ke-216-dari-244/
[5] “Pedoman Pelaksanaan Intervensi Penurunan Stunting Terintegrasi Di Kabupaten Kota”.
[6] “Random Forest Algorithm in Machine Learning.” Accessed: Dec. 09, 2024. [Online]. Available: https://www.geeksforgeeks.org/random-forest-algorithm-in-machine-learning
[7] L. Breiman, “Random Forests,” 2001.
[8] V. Kumar et al., “Addressing Binary Classification over Class Imbalanced Clinical Datasets Using Computationally Intelligent Techniques,” Healthcare (Switzerland), vol. 10, no. 7, Jul. 2022, doi: 10.3390/healthcare10071293.
[9] T. Bouabana-Tebibel and S. H. Rubin, “Advances in Intelligent Systems and Computing 446.” [Online]. Available: http://www.springer.com/series/11156
[10] R. Ghorbani and R. Ghousi, “Comparing Different Resampling Methods in Predicting Students’ Performance Using Machine Learning Techniques,” IEEE Access, vol. 8, pp. 67899–67911, 2020, doi: 10.1109/ACCESS.2020.2986809.
[11] X. Wang et al., “Diabetes mellitus early warning and factor analysis using ensemble Bayesian networks with SMOTE-ENN and Boruta,” Sci Rep, vol. 13, no. 1, Dec. 2023, doi: 10.1038/s41598-023-40036-5.
[12] N. V Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” 2002.
[13] M. Muntasir Nishat et al., “A Comprehensive Investigation of the Performances of Different Machine Learning Classifiers with SMOTE-ENN Oversampling Technique and Hyperparameter Optimization for Imbalanced Heart Failure Dataset,” Sci Program, vol. 2022, 2022, doi: 10.1155/2022/3649406.
[14] D. Varma, A. Nehansh, and P. Swathy, “Data Preprocessing Toolkit : An Approach to Automate Data Preprocessing,” Interantional Journal of Scientific Research in Engineering and Management, vol. 07, no. 03, Mar. 2023, doi: 10.55041/ijsrem18270.
[15] S. Das, M. S. Imtiaz, N. H. Neom, N. Siddique, and H. Wang, “A hybrid approach for Bangla sign language recognition using deep transfer learning model with random forest classifier,” Expert Syst Appl, vol. 213, Mar. 2023, doi: 10.1016/j.eswa.2022.118914.
[16] G. Devisetty and N. S. Kumar, “Prediction of Bradycardia using Decision Tree Algorithm and Comparing the Accuracy with Support Vector Machine,” in E3S Web of Conferences, EDP Sciences, Jul. 2023. doi: 10.1051/e3sconf/202339909004.
[17] A. Primajaya and B. N. Sari, “Random Forest Algorithm for Prediction of Precipitation,” 2018.
[18] C. Zhang, Y. Liu, and N. Tie, “Forest Land Resource Information Acquisition with Sentinel-2 Image Utilizing Support Vector Machine, K-Nearest Neighbor, Random Forest, Decision Trees and Multi-Layer Perceptron,” Forests, vol. 14, no. 2, Feb. 2023, doi: 10.3390/f14020254.
[19] T. Setiyorini et al., “Penerapan Gini Index dan K-Nearest Neighbor untuk Klasifikasi Tingkat Kognitif Soal pada Taksonomi Bloom,” Jurnal Pilar Nusa Mandiri, vol. 13, no. 2, 2017, [Online]. Available: http://www.nusamandiri.ac.id1;http://www.swadharma.ac.id/2
[20] L. C, P. S, A. H. Kashyap, A. Rahaman, S. Niranjan, and V. Niranjan, “Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers,” Cancer Inform, vol. 22, Jan. 2023, doi: 10.1177/11769351231167992.
[21] P. Soltanzadeh and M. Hashemzadeh, “RCSMOTE: Range-Controlled synthetic minority over-sampling technique for handling the class imbalance problem,” Inf Sci (N Y), vol. 542, pp. 92–111, Jan. 2021, doi: 10.1016/j.ins.2020.07.014.
[22] K. Abhishek and M. Abdelaziz, Machine learning for imbalanced data : tackle imbalanced datasets using machine learning and deep learning techniques.
[23] N. P. Y. T. Wijayanti, E. N. Kencana, and I. W. Sumarjaya, “SMOTE: Potensi dan Kekurangannya pada Survei,” E-Jurnal Matematika, vol. 10, no. 4, p. 235, Nov. 2021, doi: 10.24843/mtk.2021.v10.i04.p348.
[24] A. Salvadorrgarcíaa, M. R. Pratii, and B. Franciscooherrera, “Learning from Imbalanced Data Sets.”
[25] B. Santoso, H. Wijayanto, K. A. Notodiputro, and B. Sartono, “Synthetic over Sampling Methods for Handling Class Imbalanced Problems : A Review,” in IOP Conference Series: Earth and Environmental Science, Institute of Physics Publishing, Apr. 2017. doi: 10.1088/1755-1315/58/1/012031.
[26] G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, “A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data.”
[27] K. Wang et al., “Improving risk identification of adverse outcomes in chronic heart failure using smote +enn and machine learning,” Risk Manag Healthc Policy, vol. 14, pp. 2453–2463, 2021, doi: 10.2147/RMHP.S310295.
[28] A. Kulkarni, F. A. Batarseh, and D. Chong, “Chapter 5: Foundations of Data Imbalance and Solutions for a Data Democracy.”