Cover page and Table of Contents: PDF (size: 603KB)
Full Text (PDF, 603KB), PP.61-71
Views: 0 Downloads: 0
Cross-Validation, K-Fold, Leave-one-out, Machine learning, Computational complexity
The numerical value of k in a k-fold cross-validation training technique of machine learning predictive models is an essential element that impacts the model’s performance. A right choice of k results in better accuracy, while a poorly chosen value for k might affect the model’s performance. In literature, the most commonly used values of k are five (5) or ten (10), as these two values are believed to give test error rate estimates that suffer neither from extremely high bias nor very high variance. However, there is no formal rule. To the best of our knowledge, few experimental studies attempted to investigate the effect of diverse k values in training different machine learning models. This paper empirically analyses the prevalence and effect of distinct k values (3, 5, 7, 10, 15 and 20) on the validation performance of four well-known machine learning algorithms (Gradient Boosting Machine (GBM), Logistic Regression (LR), Decision Tree (DT) and K-Nearest Neighbours (KNN)). It was observed that the value of k and model validation performance differ from one machine-learning algorithm to another for the same classification task. However, our empirical suggest that k = 7 offers a slight increase in validations accuracy and area under the curve measure with lesser computational complexity than k = 10 across most MLA. We discuss in detail the study outcomes and outline some guidelines for beginners in the machine learning field in selecting the best k value and machine learning algorithm for a given task.
Isaac Kofi Nti, Owusu Nyarko-Boateng, Justice Aning, "Performance of Machine Learning Algorithms with Different K Values in K-fold Cross-Validation", International Journal of Information Technology and Computer Science(IJITCS), Vol.13, No.6, pp.61-71, 2021. DOI:10.5815/ijitcs.2021.06.05
Dayana C. Tejera Hernández,"An Experimental Study of K* Algorithm", IJIEEB, vol.7, no.2, pp.14-19, 2015. DOI: 10.5815/ijieeb.2015.02.03
Shaimaa Mahmoud, Mahmoud Hussein, Arabi Keshk, "Predicting Future Products Rate using Machine Learning Algorithms", International Journal of Intelligent Systems and Applications(IJISA), Vol.12, No.5, pp.41-51, 2020. DOI: 10.5815/ijisa.2020.05.04.
Seyyid Ahmed Medjahed, Mohammed Ouali, Tamazouzt Ait Saadi, Abdelkader Benyettou,"An Optimization-Based Framework for Feature Selection and Parameters Determination of SVMs", International Journal of Information Technology and Computer Science(IJITCS), vol.7, no.5, pp.1-9, 2015. DOI: 10.5815/ijitcs.2015.05.01
D.M. Allen, The Relationship between Variable Selection and Data Agumentation and a Method for Prediction, Technometrics. 16 (1974) 125. https://doi.org/10.2307/1267500.
B.G. Marcot, A.M. Hanea, What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis?, Comput. Stat. (2020). https://doi.org/10.1007/s00180-020-00999-9.
K. Jung, D.H. Bae, M.J. Um, S. Kim, S. Jeon, D. Park, Evaluation of nitrate load estimations using neural networks and canonical correlation analysis with K-fold cross-validation, Sustain. 12 (2020). https://doi.org/10.3390/SU12010400.
C.R. Rao, Y. Wu, Linear model selection by cross-validation, J. Stat. Plan. Inference. 128 (2005) 231–240. https://doi.org/10.1016/j.jspi.2003.10.004.
S. Geisser, The Predictive Sample Reuse Method with Applications, J. Am. Stat. Assoc. 70 (1975) 320–328. https://doi.org/10.1080/01621459.1975.10479865.
P. Tamilarasi, R.U.U. Rani, Diagnosis of Crime Rate against Women using k-fold Cross Validation through Machine Learning, in: 2020 Fourth Int. Conf. Comput. Methodol. Commun., IEEE, 2020: pp. 1034–1038. https://doi.org/10.1109/ICCMC48092.2020.ICCMC-000193.
I.K. Nti, A.F. Adekoya, B.A. Weyori, A comprehensive evaluation of ensemble learning for stock-market prediction, J. Big Data. 7 (2020) 20. https://doi.org/10.1186/s40537-020-00299-5.
M. Barstugan, U. Ozkaya, S. Ozturk, Coronavirus (COVID-19) Classification using CT Images by Machine Learning Methods, (2020). http://arxiv.org/abs/2003.09424.
S. Tuarob, C.S. Tucker, M. Salathe, N. Ram, An ensemble heterogeneous classification methodology for discovering health-related knowledge in social media messages, J. Biomed. Inform. 49 (2014) 255–268. https://doi.org/10.1016/j.jbi.2014.03.005.
B.A. Tama, S. Im, S. Lee, Improving an Intelligent Detection System for Coronary Heart Disease Using a Two-Tier Classifier Ensemble, Biomed Res. Int. 2020 (2020) 1–10. https://doi.org/10.1155/2020/9816142.
A. Oztekin, R. Kizilaslan, S. Freund, A. Iseri, A data analytic approach to forecasting daily stock returns in an emerging market, Eur. J. Oper. Res. 253 (2016) 697–710. https://doi.org/10.1016/j.ejor.2016.02.056.
L. D, B. Vishnuvardhan, Classification Performance Improvement Using Random Subset Feature Selection Algorithm for Data Mining, Big Data Res. 12 (2018) 1–12. https://doi.org/10.1016/j.bdr.2018.02.007.
K.T. Chui, R.W. Liu, M. Zhao, P.O. De Pablos, Predicting Students’ Performance With School and Family Tutoring Using Generative Adversarial Network-Based Deep Support Vector Machine, IEEE Access. 8 (2020) 86745–86752. https://doi.org/10.1109/ACCESS.2020.2992869.
S. Simsek, U. Kursuncu, E. Kibis, M. AnisAbdellatif, A. Dag, A hybrid data mining approach for identifying the temporal effects of variables associated with breast cancer survival, Expert Syst. Appl. 139 (2020). https://doi.org/10.1016/j.eswa.2019.112863.
I.K. Nti, A.F. Adekoya, B.A. Weyori, Efficient Stock-Market Prediction Using Ensemble Support Vector Machine, Open Comput. Sci. 10 (2020) 153–163. https://doi.org/10.1515/comp-2020-0199.
I.K. Nti, A.F. Adekoya, O. Nyarko-Boateng, A Multifactor Authentication Framework for the National Health Insurance Scheme in Ghana using Machine Learning, 13 (2020) 639–648. https://doi.org/10.3844/ajeassp.2020.639.648.
O. Karal, Performance comparison of different kernel functions in SVM for different k value in k-fold cross-validation, in: 2020 Innov. Intell. Syst. Appl. Conf., IEEE, 2020: pp. 1–5. https://doi.org/10.1109/ASYU50717.2020.9259880.
R. Ghorbani, R. Ghousi, Comparing Different Resampling Methods in Predicting Students’ Performance Using Machine Learning Techniques, IEEE Access. 8 (2020) 67899–67911. https://doi.org/10.1109/ACCESS.2020.2986809.
L. Fang, S. Liu, Z. Huang, Uncertain Johnson–Schumacher growth model with imprecise observations and k-fold cross-validation test, Soft Comput. 24 (2020) 2715–2720. https://doi.org/10.1007/s00500-019-04090-4.
N. Codella, J. Cai, M. Abedini, R. Garnavi, A. Halpern, J.R. Smith, Deep Learning, Sparse Coding, and SVM for Melanoma Recognition in Dermoscopy Images, in: L. Zhou, L. Wang, Q. Wang, Y. Shi (Eds.), Mach. Learn. Med. Imaging, Springer International Publishing, Cham, 2015: pp. 118–126.
M.R. Wayahdi, D. Syahputra, S. Hafiz, N. Ginting, Evaluation of the K-Nearest Neighbor Model With K-Fold Cross Validation on Image Classification, J. Infokum. 9 (2020) 1–6.
B. Boehmke, B.M. Greenwell, Hands-on machine learning with R, CRC Press, 2019.
H. Ahmed, E.M.G. Younis, A. Hendawi, A.A. Ali, Heart disease identification from patients’ social posts, machine learning solution on Spark, Futur. Gener. Comput. Syst. 111 (2020) 714–722. https://doi.org/10.1016/j.future.2019.09.056.