Kuntoro Kuntoro*
The most popular model in machine learning is K-Nearest Neighbours (KNN). It is used for solving classification. Moreover, K- Fold Crossvalidation is an important tool for assessing the performance of machine learning in doing KNN algorithm given available data. Compared to traditional statistical methods, both algorithms are effective to be implemented in big data. A supervised machine learning approach using KNN and K- Fold Cross- Validation algorithms is implemented in this study. For learning process, data of covid 19 is obtained from website. Four predictors such as new case, reproduction rate, new case in ICU, and hospitalized new case are selected to predict the target, new cases will be alive or will die. After cleaning process, 13,223 of 132,645 data sets are selected. This is considered as original data sets. When K-Fold Cross-Validation is executed by Python showing User Warning, the original data sets are replicated to be 264,441 data sets. This is considered as replicated data sets. Performance of KNN algorithm in predicting the target using original data sets shows lower accuracy than that using replicated data sets (75% vs. 92%). The number of members (K) using original data sets is lower than that using replicated data sets (7 vs. 12). Performance of K-Fold Cross-Validation using original data sets shows very small mean accuracy than that using replicated data sets (0.054 vs. 0.998). In using replicated data sets, mean accuracy shows consistent value until 5 splits while in using original data sets mean accuracy only shows in 2 splits. In using big data from various sources, it is recommended to implement appropriate Python libraries which can remove not a number (nan) and messy record effectively. It is also recommended to develop combine and comprehensive algorithm of KNN and K-Fold Cross-Validation.
分享此文章