Karthika Natarajan; Koteswara Rao Makkena

Hepatitis C Diagnosis using Supervised Machine Learning Algorithms and Ensemble Learning Techniques

PDF (2414KB), PP.161-182

Views: 0 Downloads: 0

Author(s)

Karthika Natarajan ^1,* Koteswara Rao Makkena ¹

1. School of Computer Science and Engineering, VIT-AP University, Amaravathi, 522237, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2026.02.10

Received: 3 Nov. 2025 / Revised: 28 Jan. 2026 / Accepted: 24 Feb. 2026 / Published: 8 Apr. 2026

Index Terms

Hepatitis C, SMOTE, ADASYN, Random Forest Classifier, Adaboost Classifier, SHAP, LIME

Abstract

Hepatitis, a severe and highly impactful disease, poses significant challenges for healthcare systems, including limited diagnostic resources, delayed detection, and inadequate treatment infrastructure. This work addresses these issues by developing a machine-learning predictive system to classify hepatitis severity. By employing Logistic Regression, Random Forest, SVM, KNN, and ensemble techniques such as AdaBoost, CatBoost, and Gradient Boosting, the system enhances early detection and severity assessment. The issue of class imbalance was addressed using ADASYN and SMOTE methods applied to two separate datasets. For Dataset 1, following the use of the ADASYN technique, the achieved accuracies were 88.11% for Logistic Regression, 98.92% for Random Forest, 97.30% for AdaBoost, and 96.22% for Gradient Boosting. When SMOTE was employed on Dataset 1, Random Forest and Gradient Boosting reached accuracies of 98.38% and 96.76%, respectively. In the case of Dataset 2, AdaBoost achieved an accuracy of 93.75% after applying both ADASYN and SMOTE. These models analyze clinical data to deliver accurate, timely predictions, reducing the burden on resource-constrained healthcare systems. Ensemble methods enhance model robustness and accuracy, supporting improved decision-making and efficient resource allocation. Furthermore, SHAP offers global explanations of feature importance and force plots for local interpretations, while LIME increases the interpretability of results from black-box models, facilitating effective hepatitis management. Future work will focus on integrating interoperability standards, such as HL7 FHIR, to enable real-time data exchange, facilitating seamless risk assessment and clinical decision support within healthcare workflows.

Cite This Paper

Karthika Natarajan, Koteswara Rao Makkena, "Hepatitis C Diagnosis using Supervised Machine Learning Algorithms and Ensemble Learning Techniques", International Journal of Information Technology and Computer Science(IJITCS), Vol.18, No.2, pp.161-182, 2026. DOI:10.5815/ijitcs.2026.02.10

Reference

[1]K. Polat and S. Güneş, “Prediction of hepatitis disease based on principal component analysis and artificial immune recognition system,” Appl. Math. Comput., vol. 189, no. 2, pp. 1282–1291, 2007, doi:10.1016/j.amc.2006.12.010.
[2]E. Dogantekin, A. Dogantekin, and D. Avci, “Automatic hepatitis diagnosis system based on linear discriminant analysis and adaptive network based fuzzy inference system,” Expert Syst. Appl., vol. 36, no. 8, pp. 11282–11286, 2009, doi:10.1016/j.eswa.2008.02.033.
[3]World Health Organization, “Hepatitis C,” Fact sheet, 2024. Available: https://www.who.int/news-room/fact-sheets/detail/hepatitis-c
[4]B. Rehermann and M. Nascimbeni, “Immunology of hepatitis B virus and hepatitis C virus infection,” Nat. Rev. Immunol., vol. 5, no. 3, pp. 215–229, 2005, doi:10.1038/nri1573.
[5]K. Polat and S. Güneş, “Hepatitis disease diagnosis using a new hybrid system based on feature selection and artificial immune recognition system with fuzzy resource allocation,” Digit. Signal Process., vol. 16, no. 6, pp. 889–901, 2006, doi:10.1016/j.dsp.2006.06.003.
[6]E. K. AbuSharekh and S. S. Abu-Naser, “Diagnosis of hepatitis virus using artificial neural network,” Int. J. Acad. Eng. Res., vol. 2, no. 11, pp. 1–7, 2018.
[7]E. A. Bayrak, P. Kirci, and T. Ensari, “Performance analysis of machine learning algorithms and feature selection methods on hepatitis disease,” Int. J. Multidiscip. Stud. Innov. Technol., vol. 3, no. 2, pp. 135–138, 2019.
[8]M. Nilashi, H. Ahmadi, L. Shahmoradi, O. Ibrahim, and E. Akbari, “A predictive method for hepatitis disease diagnosis using ensembles of neuro-fuzzy technique,” J. Infect. Public Health, vol. 12, no. 1, pp. 13–20, 2019, doi:10.1016/j.jiph.2018.09.008.
[9]World Health Organization Regional Office for Africa, “Prevention, care and treatment of viral hepatitis in the African region: framework for action, 2016–2020,” 2017. Available: https://www.afro.who.int/publications/prevention-care-and-treatment-viral-hepatitis-african-region-framework-action-2016
[10]C. A. U. Hassan, M. S. Khan, and M. A. Shah, “Comparison of machine learning algorithms in data classification,” in Proc. 24th Int. Conf. Autom. Comput. (ICAC), 2018, pp. 1–6, doi:10.23919/IConAC.2018.8748993.
[11]L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001, doi:10.1023/A:1010933404324.
[12]V. Vapnik and A. Chervonenkis, “The necessary and sufficient conditions for consistency in the empirical risk minimization method,” Pattern Recognit. Image Anal., vol. 1, no. 3, pp. 283–305, 1991, doi:10.1007/BF00941892.
[13]C. A. U. Hassan, M. S. Khan, and M. A. Shah, “Comparison of machine learning algorithms in data classification,” in Proc. 24th Int. Conf. Autom. Comput. (ICAC), 2018, pp. 1–6.
[14]W. Akbar, W. P. Wu, S. Saleem, M. Farhan, M. A. Saleem, A. Javeed, and L. Ali, “Development of hepatitis disease detection system by exploiting sparsity in linear support vector machine to improve strength of AdaBoost ensemble model,” Mobile Inf. Syst., vol. 2020, Art. ID 8870240, pp. 1–9, 2020, doi:10.1155/2020/8870240.
[15]K. Kanagarathinam, D. Sankaran, and R. Manikandan, “Machine learning-based risk prediction model for cardiovascular disease using a hybrid dataset,” Data Knowl. Eng., vol. 140, Art. ID 102042, 2022, doi:10.1016/j.datak.2022.102042.
[16]H. Park et al., “Machine learning algorithms for predicting direct-acting antiviral treatment failure in chronic hepatitis C: An HCV-TARGET analysis,” Hepatology, vol. 76, no. 2, pp. 483–491, 2022, doi:10.1002/hep.32316.
[17]Kaggle, “Hepatitis C predictions – HepatitisCdata.csv,” 2019. Available: https://www.kaggle.com/code/krishnabhatt4/hepatitis-c-predictions/input/HepatitisCdata.csv
[18]D. Dua and C. Graff, “Hepatitis Data Set,” UCI Machine Learning Repository, Univ. California, Irvine, CA, USA, 2019. Available: https://archive.ics.uci.edu/ml/datasets/hepatitis, doi:10.24432/C5JP5K.
[19]X. Tian et al., “Using machine learning algorithms to predict hepatitis B surface antigen seroclearance,” Comput. Math. Methods Med., vol. 2019, Art. ID 8950193, pp. 1–9, 2019, doi:10.1155/2019/8950193.
[20]K. K. Napa and V. Dhamodaran, “Hepatitis-infectious disease prediction using classification algorithms,” Res. J. Pharm. Technol., vol. 12, no. 8, pp. 3720–3725, 2019, doi:10.5958/0974-360X.2019.00639.4.
[21]V. K. Yarasuri, G. K. Indukuri, and A. K. Nair, “Prediction of hepatitis disease using machine learning technique,” in Proc. 3rd Int. Conf. I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), 2019, pp. 265–269, doi:10.1109/I-SMAC47947.2019.9032542.
[22]A. Orooji and F. Kermani, “Machine learning based methods for handling imbalanced data in hepatitis diagnosis,” Front. Health Inform., vol. 10, no. 1, p. 57, 2021, doi:10.30699/fhi.v10i1.57.
[23]M. Glučina, A. Lorencin, N. Anđelić, and I. Lorencin, “Cervical cancer diagnostics using machine learning algorithms and class balancing techniques,” Appl. Sci., vol. 13, no. 2, p. 1061, 2023, doi:10.3390/app13021061.
[24]K. Sachdeva, P. Bathla, P. Rani, V. Solanki, and R. Ahuja, “A systematic method for diagnosis of hepatitis disease using machine learning,” Innov. Syst. Softw. Eng., vol. 19, no. 1, pp. 71–80, 2023, doi:10.1007/s11334-022-00463-2.
[25]A. Doğru, S. Buyrukoğlu, and M. Arı, “A hybrid super EL model for the early-stage prediction of diabetes risk,” Med. Biol. Eng. Comput., vol. 61, no. 3, pp. 785–797, 2023, doi:10.1007/s11517-022-02719-7.
[26]N. H. Taz, A. Islam, and I. Mahmud, “A comparative analysis of ensemble-based machine learning techniques for diabetes identification,” in Proc. 2nd Int. Conf. Robot., Electr. Signal Process. Techn. (ICREST), 2021, pp. 1–6, doi:10.1109/ICREST51555.2021.9331154.
[27]Z. Sadeghi et al., “A brief review of explainable artificial intelligence in healthcare,” arXiv preprint arXiv:2304.01543, 2023.
[28]A. M. Ali et al., “Explainable machine learning approach for hepatitis C diagnosis using SFS feature selection,” Machines, vol. 11, no. 3, p. 391, 2023, doi:10.3390/machines11030391.
[29]K. D. Mandl et al., “SMART on FHIR: A standards-based, interoperable apps platform for electronic health records,” J. Am. Med. Inform. Assoc., vol. 23, no. 5, pp. 899–908, 2016, doi:10.1093/jamia/ocv189.
[30]H. Lin, R. Zhang, and T. Tong, “When Tukey meets Chauvenet: A new boxplot criterion for outlier detection,” arXiv preprint arXiv:2506.XXXXX, 2025. (Update arXiv ID when available.)
[31]S. Prasad et al., “Feature selection using ANOVA F-test with SelectKBest in load recognition for home energy management systems,” Sensors, vol. 24, no. 15, p. 4965, 2024, doi:10.3390/s24154965.
[32]H. Liu and L. Yu, “Toward integrating feature selection algorithms for classification and clustering,” IEEE Trans. Knowl. Data Eng., vol. 17, no. 4, pp. 491–502, 2005, doi:10.1109/TKDE.2005.66.
[33]T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., 2016, pp. 785–794, doi:10.1145/2939672.2939785.
[34]D. Sinha, “Using and scaling SHAP on Databricks for turbocharging your model explainability: Part 1,” Medium, 2023. Available: https://medium.com/@debusinha2009/using-and-scaling-shap-on-databricks-for-turbocharging-your-model-explainability-part-1-29099006f6b4
[35]M. T. Ribeiro, S. Singh, and C. Guestrin, “‘Why should I trust you?’: Explaining the predictions of any classifier,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. (KDD ’16), 2016, pp. 1135–1144, doi:10.1145/2939672.2939778.
[36]W. Książek, M. Gandor, and P. Pławiak, “Comparison of various approaches to combine logistic regression with genetic algorithms in survival prediction of hepatocellular carcinoma,” Comput. Biol. Med., vol. 134, p. 104431, 2021, doi:10.1016/j.compbiomed.2021.104431.
[37]L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001, doi:10.1023/A:1010933404324.
[38]C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn., vol. 20, no. 3, pp. 273–297, 1995, doi:10.1007/BF00994018.
[39]R. K. Halder, M. N. Uddin, M. A. Uddin, S. Aryal, and A. Khraisat, “Enhancing K-nearest neighbor algorithm: A comprehensive review and performance analysis of modifications,” J. Big Data, vol. 11, p. 113, 2024, doi:10.1186/s40537-024-00863-4.
[40]I. D. Mienye and Y. Sun, “A survey of EL: Concepts, algorithms, applications, and prospects,” IEEE Access, vol. 10, pp. 99129–99149, 2022, doi:10.1109/ACCESS.2022.3203283.
[41]M. A. Mohammed, B. Al-Khateeb, M. K. Abd Ghani, S. A. Mostafa, and M. S. Maashi, “A comprehensive review on ensemble deep learning: Opportunities and challenges,” J. King Saud Univ. – Comput. Inf. Sci., vol. 35, no. 2, pp. 757–774, 2023, doi:10.1016/j.jksuci.2021.05.012.
[42]I. D. Mienye and Y. Sun, “A survey of EL: Concepts, algorithms, applications, and prospects,” IEEE Access, vol. 10, pp. 99129–99149, 2022, doi: 10.1109/ACCESS.2022.3207287. (Duplicate of, keep only one in final list.)
[43]W. Chen, K. Yang, Z. Yu et al., “A survey on imbalanced learning: Latest research, applications and future directions,” Artif. Intell. Rev., vol. 57, p. 137, 2024, doi:10.1007/s10462-024-10706-y.
[44]DataCamp, “AdaBoost classifier in Python,” 2023. Available: https://www.datacamp.com/tutorial/adaboost-classifier-python
[45]G. Huang et al., “Evaluation of CatBoost method for prediction of reference evapotranspiration in humid regions,” J. Hydrol., vol. 574, pp. 1029–1041, 2019, doi:10.1016/j.jhydrol.2019.04.059.
[46]A. Doğru, S. Buyrukoğlu, and M. Arı, “A hybrid super EL model for the early-stage prediction of diabetes risk,” Med. Biol. Eng. Comput., vol. 61, no. 3, pp. 785–797, 2023.
[47]N. V. Chawla et al., “SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, 2002, doi:10.1613/jair.953.
[48]H. He, Y. Bai, E. A. Garcia, and S. Li, “ADASYN: Adaptive synthetic sampling approach for imbalanced learning,” in Proc. IEEE Int. Joint Conf. Neural Netw. (IJCNN), 2008, pp. 1322–1328, doi:10.1109/IJCNN.2008.4633969.
[49]M. Sokolova and G. Lapalme, “A systematic analysis of performance measures for classification tasks,” Inf. Process. Manag., vol. 45, no. 4, pp. 427–437, 2009, doi:10.1016/j.ipm.2009.03.002.

International Journal of Information Technology and Computer Science (IJITCS)