A Multi-Stage Approach Combining Feature Selection with Machine Learning Techniques for Higher Prediction Reliability and Accuracy in Cervical Cancer Diagnosis

Full Text (PDF, 784KB), PP.46-63

Views: 0 Downloads: 0


Avijit Kumar Chaudhuri 1,* Arkadip Ray 2 Dilip K. Banerjee 3 Anirban Das 4

1. Department of Computer Application, SEACOM SKILLS UNIVERSITY, Kendradangal, Bolpur, Birbhum, 731 236, West Bengal, India

2. Department of Information Technology, Government College of Engineering and Ceramic Technology, Kolkata, West Bengal, 700010, India

3. Department of Computer Application, SEACOM SKILLS UNIVERSITY, Kendradangal, Bolpur, Birbhum, 731236, West Bengal, India

4. University of Engineering & Management, Kolkata, West Bengal, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijisa.2021.05.05

Received: 7 Jul. 2021 / Revised: 9 Aug. 2021 / Accepted: 23 Aug. 2021 / Published: 8 Oct. 2021

Index Terms

Cervical Cancer, Feature Selection, Genetic Algorithm (GA), Logistic Regression (LR), Gradient Boosting (GDB)


Cervical cancer is the fourth most prevalent cancer in women which has claimed 3,41,831 lives and accounted for 6,04,127 new cases in 2020 worldwide. To reduce such a vast mortality rate, early detection of the disease is essential. A fast, accurate, and interpretable machine learning model is a research subject. Fewer features reduce the computational effort and improve interpretation. A 3-Stage Hybrid feature selection approach and a Stacked Classification model are evaluated on the cervical cancer dataset obtained from the UCI Machine Learning Repository with 35 features and one outcome variable. Stage-1 uses a Genetic Algorithm and Logistic Regression Architecture for Feature Selection and selects twelve features well correlated with the class but not among themselves. Stage-2 utilizes the same Genetic Algorithm and Logistic Regression Architecture for Feature Selection to select five features. In Stage-3, Logistic Regression (LR), Naïve Bayes (NB), Support Vector Machine (SVM), Extra Trees (ET), Random Forest (RF), and Gradient Boosting (GDB) are used with the five features to identify patients with or without cancer. Data splitting, several metrics, and statistical tests are used, along with 10-fold cross validation, to do a comparative analysis. LR, NB, SVM, ET, RF, and GDB demonstrate improvement across performance measures by reducing the number of features to five. In the 66-34 split, all five machine learning methods except NB recorded 97% accuracy with 5 features. Also, the Stacked model produced higher than 96% accuracy with five features in 66-34 and 80-20 splits, and in 10-fold cross validation. Various performance aggregators have shown improved results with reduced features when compared to previous studies. Finally, with approximately 100% performance in classification results, the suggested ensemble model showed its promise. The output results were compared to those of other studies on the same dataset, and the proposed classifiers were found to be the most effective across all performance dimensions.

Cite This Paper

Avijit Kumar Chaudhuri, Arkadip Ray, Dilip K. Banerjee, Anirban Das, "A Multi-Stage Approach Combining Feature Selection with Machine Learning Techniques for Higher Prediction Reliability and Accuracy in Cervical Cancer Diagnosis", International Journal of Intelligent Systems and Applications(IJISA), Vol.13, No.5, pp.46-63, 2021. DOI: 10.5815/ijisa.2021.05.05


[1]H. Sung, J. Ferlay, R. L. Siegel, M. Laversanne, I. Soerjomataram, A. Jemal, and F. Bray, “Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries,” CA: Cancer J. Clin., vol. 71, pp. 209-249, 2021.
[2]N. Kamil and S. Kamil, “Global cancer incidences, causes and future predictions for subcontinent region,” Syst. Rev. Pharm., vol. 6, pp. 13, 2015.
[3]R. L. Siegel, K. D. Miller, and A. Jemal, “Cancer statistics, 2016,” CA: Cancer J. Clin., vol. 66, pp. 7-30, 2016.
[4]G. A. Mishra, S. A. Pimple, and S. S. Shastri, “An overview of prevention and early detection of cervical cancers,” Indian J. Med. Paediatr. Oncol., vol. 32, pp. 125, 2011.
[5]M. Schiffman, P. E. Castle, J. Jeronimo, A. C. Rodriguez, and S. Wacholder, “Human papillomavirus and cervical cancer,” Lancet, vol. 370, pp. 890-907, 2007.
[6]A. C. Rodríguez, M. Schiffman, R. Herrero, A. Hildesheim, C. Bratti, M. E. Sherman, and R. D. Burk, “Longitudinal study of human papillomavirus persistence and cervical intraepithelial neoplasia grade 2/3: critical role of duration of infection,” J. Natl. Canc. Inst., vol. 102, pp. 315-324, 2010.
[7]World Health Organization (WHO), https://www.who.int/news-room/fact-sheets/detail/human-papillomavirus-(hpv)-and-cervical-cancer. (Accessed 28 March 2021).
[8]M. E. Plissiti and C. Nikou, “A review of automated techniques for cervical cell image analysis and classification,” in Biomedical Imaging and Computational Modeling in Biomechanics, U. Andreaus and D. Iacoviello, Eds. Dordrecht: Springer, 2013, pp. 1-18.
[9]A. Jemal, M. M. Center, C. DeSantis, and E. M. Ward, “Global patterns of cancer incidence and mortality rates and trends,” Cancer Epidemiol. Biomarkers Prev., vol. 19, pp. 1893-1907, 2010.
[10]S. Bobdey, J. Sathwara, A. Jain, and G. Balasubramaniam, “Burden of cervical cancer and role of screening in India,” Indian J. Med. Paediatr. Oncol., vol. 37, pp. 278, 2016.
[11]L. Kjellberg, G. Hallmans, A. M. Åhren, R. Johansson, F. Bergman, G. Wadell, and J. Dillner, “Smoking, diet, pregnancy and oral contraceptive use as risk factors for cervical intra-epithelial neoplasia in relation to human papillomavirus infection,” Br. J. Canc., vol. 82, pp. 1332-1338, 2000.
[12]M. Plummer, R. Herrero, S. Franceschi, C. J. Meijer, P. Snijders, F. X. Bosch, and N. Muñoz, “Smoking and cervical cancer: pooled analysis of the IARC multi-centric case–control study,” Canc. Causes Contr., vol. 14, pp. 805-814, 2003.
[13]P. Luhn, J. Walker, M. Schiffman, R. E. Zuna, S. T. Dunn, M. A. Gold, and N. Wentzensen, “The role of co-factors in the progression from human papillomavirus infection to cervical cancer,” Gynecol. Oncol., vol. 128, pp. 265-270, 2013.
[14]V. Moreno, F. X. Bosch, N. Muñoz, C. J. Meijer, K. V. Shah, J. M. Walboomers, and International Agency for Research on Cancer (IARC) Multicentric Cervical Cancer Study Group, “Effect of oral contraceptives on risk of cervical cancer in women with human papillomavirus infection: the IARC multicentric case-control study,” Lancet, vol. 359, pp. 1085-1092, 2002.
[15]S. R. Pradhan, S. Mahata, D. Ghosh, P. K. Sahoo, S. Sarkar, R. Pal, and V. D. Nasare, “Human Papillomavirus Infections in Pregnant Women and Its Impact on Pregnancy Outcomes: Possible Mechanism of Self-Clearance,” in Human Papillomavirus, R. Rajkumar, Eds. IntechOpen, 2020, pp. 1-27.
[16]S. Subramanian, R. Sankaranarayanan, P. O. Esmy, J. V. Thulaseedharan, R. Swaminathan, and S. Thomas, “Clinical trial to implementation: Cost and effectiveness considerations for scaling up cervical cancer screening in low-and middle-income countries,” J. Cancer Policy, vol. 7, pp. 4-11, 2016.
[17]K. U. Petry, “HPV and cervical cancer,” Scand. J. Clin. Lab. Invest., vol. 74, pp. 59-62, 2014.
[18]G. Ronco, J. Dillner, K. M. Elfström, S. Tunesi, P. J. Snijders, M. Arbyn, and International HPV Screening Working Group, “Efficacy of HPV-based screening for prevention of invasive cervical cancer: follow-up of four European randomised controlled trials,” Lancet, vol. 383, pp. 524-532, 2014.
[19]Y. Hiraku, S. Kawanishi, and H. Ohshima, Eds. Cancer and inflammation mechanisms: chemical, biological, and clinical aspects. New Jersey: John Wiley & Sons, 2014.
[20]F. Bray, J. Ferlay, I. Soerjomataram, R. L. Siegel, L. A. Torre, and A. Jemal, “Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries,” CA: Cancer J. Clin., vol. 68, pp. 394-424, 2018.
[21]R. A. Kerkar and Y. V. Kulkarni, “Screening for cervical cancer: an overview,” J. Obstet. Gynecol. India, vol. 56, pp. 115-122, 2006.
[22]J. W. Sellors and R. Sankaranarayanan, Colposcopy and treatment of cervical intraepithelial neoplasia: a beginner’s manual. International Agency for Research on Cancer, 2003.
[23]H. Ramaraju, Y. Nagaveni, and A. Khazi, “Use of Schiller’s test versus Pap smear to increase the detection rate of cervical dysplasias,” Int. J. Reprod. Contracept. Obstet. Gynecol., vol. 5, pp. 1446-1450, 2017.
[24]E. Bengtsson and P. Malm, “Screening for cervical cancer using automated analysis of PAP-smears,” Comput. Math. Meth. Med., 2014.
[25]G. Guvenc, A. Akyuz, and C. H. Açikel, “Health belief model scale for cervical cancer and Pap smear test: psychometric testing,” J. Adv. Nurs., vol. 67, pp. 428-437, 2011.
[26]K. Fernandes, J. S. Cardoso, and J. Fernandes, “Transfer learning with partial observability applied to cervical cancer screening,” in Iberian conference on pattern recognition and image analysis, Cham: Springer, June 2017, pp. 243-250.
[27]M. T. Galgano, P. E. Castle, K. A. Atkins, W. K. Brix, S. R. Nassau, and M. H. Stoler, “Using biomarkers as objective standards in the diagnosis of cervical biopsies,” Am. J. Surg. Pathol., vol. 34, pp. 1077, 2010.
[28]M. U. Sarwar, M. K. Hanif, R. Talib, A. Mobeen, and M. Aslam, “A survey of big data analytics in healthcare,” Int. J. Adv. Comput. Sci. Appl., vol. 8, pp. 355-359, 2017.
[29]E. W. Steyerberg, Clinical prediction models. Cham: Springer International Publishing, 2019.
[30]M. Tubishat, N. Idris, L. Shuib, M. A. Abushariah, and S. Mirjalili, “Improved Salp Swarm Algorithm based on opposition based learning and novel local search algorithm for feature selection,” Expert Syst. Appl., vol. 145, 2020.
[31]S. Maldonado, J. López, A. Jimenez-Molina, and H. Lira, “Simultaneous feature selection and heterogeneity control for SVM classification: An application to mental workload assessment,” Expert Syst. Appl., vol. 143, 2020.
[32]M. Shouman, T. Turner, and R. Stocker, “Applying k-nearest neighbour in diagnosing heart disease patients,” Int. J. Inf. Educ. Technol., vol. 2, pp. 220-223, 2012.
[33]Y. Ji, S. Yu, and Y. Zhang, “A novel naive bayes model: Packaged hidden naive bayes,” in 2011 6th IEEE Joint International Information Technology and Artificial Intelligence Conference, vol. 2, IEEE, August 2011, pp. 484-487.
[34]B. Schölkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett, “New support vector algorithms,” Neural Comput., vol. 12, pp. 1207-1245, 2000.
[35]G. Cavallaro, M. Riedel, M. Richerzhagen, J. A. Benediktsson, and A. Plaza, “On understanding big data impacts in remotely sensed image classification using support vector machine methods,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 8, pp. 4634-4646, 2015.
[36]Y. Tang and J. Zhou, “The performance of PSO-SVM in inflation forecasting,” in 2015 12th International Conference on Service Systems and Service Management (ICSSSM), IEEE, June 2015, pp. 1-4.
[37]L. Breiman, “Random forests,” Mach. Learn., vol. 45, pp. 5-32, 2001.
[38]X. Chen and H. Ishwaran, “Random forests for genomic data analysis,” Genomics, vol. 99, pp. 323-329, 2012.
[39]T. Shaikhina, D. Lowe, S. Daga, D. Briggs, R. Higgins, and N. Khovanova, “Decision tree and random forest models for outcome prediction in antibody incompatible kidney transplantation,” Biomed. Signal Process. Contr., vol. 52, pp. 456-462, 2019.
[40]M. A. Babyak, “What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models,” Psychosom. Med., vol. 66, pp. 411-421, 2004.
[41]O. Gayou, S. K. Das, S. M. Zhou, L. B. Marks, D. S. Parda, and M. Miften, “A genetic algorithm for variable selection in logistic regression analysis of radiotherapy treatment outcomes,” Med. Phys., vol. 35, pp. 5426-5433, 2008.
[42]A. K. Chaudhuri and A. Das, “Variable Selection in Genetic Algorithm Model with Logistic Regression for Prediction of Progression to Diseases,” in 2020 IEEE International Conference for Innovation in Technology (INOCON), IEEE, November 2020, pp. 1-6.
[43]L. Connelly, “Logistic regression,” Medsurg Nurs., vol. 29, pp. 353-354, 2020.
[44]E. Tzeng, C. Devin, J. Hoffman, C. Finn, P. Abbeel, S. Levine, and T. Darrell, “Adapting deep visuomotor representations with weak pairwise constraints,” in Algorithmic Foundations of Robotics XII, Cham: Springer, December 2016, pp. 688-703.
[45]M. A. Jabbar and S. Samreen, “Heart disease prediction system based on hidden naïve bayes classifier,” in 2016 International Conference on Circuits, Controls, Communications and Computing (I4C), IEEE, October 2016, pp. 1-5.
[46]C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn., vol. 20, pp. 273-297, 1995.
[47]N. E. Ayat, M. Cheriet, and C. Y. Suen, “Automatic model selection for the optimization of SVM kernels,” Pattern Recogn., vol. 38, pp. 1733-1745, 2005.
[48]J. Kamruzzaman and R. K. Begg, “Support vector machines and other pattern recognition approaches to the diagnosis of cerebral palsy gait,” IEEE Trans. Biomed. Eng., vol. 53, pp. 2479-2490, 2006.
[49]Y. J. Son, H. G. Kim, E. H. Kim, S. Choi, and S. K. Lee, “Application of support vector machine for prediction of medication adherence in heart failure patients,” Healthc. Inform. Res., vol. 16, pp. 253-259, 2010.
[50]T. Ishikawa, J. Takahashi, H. Takemura, H. Mizoguchi, and T. Kuwata, “Gastric lymph node cancer detection using multiple features support vector machine for pathology diagnosis support system,” in The 15th International Conference on Biomedical Engineering, Cham: Springer, 2014, pp. 120-123.
[51]V. Shah, B. Turkbey, H. Mani, Y. Pang, T. Pohida, M. J. Merino, and M. Bernardo, “Decision support system for localizing prostate cancer based on multiparametric magnetic resonance imaging,” Med. Phys., vol. 39, pp. 4093-4103, 2012.
[52]W. Y. Loh, “Classification and regression trees,” Wiley Interdiscip. Rev. Data Min. Knowl. Discov., vol. 1, pp. 14-23, 2011.
[53]A. Liaw and M. Wiener, “Classification and regression by randomForest,” R news, vol. 2, pp. 18-22, 2002.
[54]A. K. Verma, S. Pal, and S. Kumar, “Prediction of skin disease using ensemble data mining techniques and feature selection method—a comparative study,” Appl. Biochem. Biotechnol., vol. 190, pp. 341-359, 2020.
[55]O. Maier, M. Wilms, J. von der Gablentz, U. M. Krämer, T. F. Münte, and H. Handels, “Extra tree forests for sub-acute ischemic stroke lesion segmentation in MR sequences,” J. Neurosci. Meth., vol. 240, pp. 89-100, 2015.
[56]J. H. Friedman, “Greedy function approximation: a gradient boosting machine,” Ann. Stat., pp. 1189-1232, 2001.
[57]S. Dodd, M. Berk, K. Kelin, Q. Zhang, E. Eriksson, W. Deberdt, and J. C. Nelson, “Application of the Gradient Boosted method in randomised clinical trials: Participant variables that contribute to depression treatment efficacy of duloxetine, SSRIs or placebo,” J. Affect. Disord., vol. 168, pp. 284-293, 2014.
[58]J. Xie and S. Coggeshall, “Prediction of transfers to tertiary care and hospital mortality: A gradient boosting decision tree approach,” Stat. Anal. Data Min., vol. 3, pp. 253-258, 2010.
[59]Y. Chen, Z. Jia, D. Mercola, and X. Xie, “A gradient boosting algorithm for survival analysis via direct optimization of concordance index,” Comput. Math. Meth. Med., 2013.
[60]J. C. Weiss, D. Page, P. L. Peissig, S. Natarajan, and C. McCarty, “Statistical relational learning to predict primary myocardial infarction from electronic health records,” Proc. Innov. Appl. Artif. Intell. Conf., vol. 2012, pp. 2341-2347, 2012.
[61]R. Martin, D. Rose, K. Yu, and S. Barros, “Toxicogenomics Strategies for Predicting Drug Toxicity,” Pharmacogenomics, vol. 7, pp. 1003-1016, 2006.
[62]A. Ray and A. K. Chaudhuri, “Smart healthcare disease diagnosis and patient management: Innovation, improvement and skill development,” Machine Learning with Applications, vol. 3, 2021.
[63]L. F. Chalak, L. Pavageau, B. Huet, and L. Hynan, “Statistical rigor and kappa considerations: which, when and clinical context matters,” Pediatr. Res., vol. 88, pp. 5, 2020.
[64]N. Razali, S. A. Mostafa, A. Mustapha, M. H. Abd Wahab, and N. A. Ibrahim, “Risk Factors of Cervical Cancer using Classification in Data Mining,” J. Phys. Conf., vol. 1529, pp. 022102, April 2020.
[65]G. Vandewiele, I. Dehaene, G. Kovács, L. Sterckx, O. Janssens, F. Ongenae, and T. Demeester, “Overly optimistic prediction results on imbalanced data: a case study of flaws and benefits when applying over-sampling,” Artif. Intell. Med., vol. 111, 2021.
[66]E. Ahishakiye, R. Wario, W. Mwangi, and D. Taremwa, “Prediction of Cervical Cancer Basing on Risk Factors using Ensemble Learning,” in 2020 IST-Africa Conference (IST-Africa), IEEE, May 2020, pp. 1-12.
[67]J. Lu, E. Song, A. Ghoneim, and M. Alrashoud, “Machine learning for assisting cervical cancer diagnosis: An ensemble approach,” Future Generat. Comput. Syst., vol. 106, pp. 199-205, 2020.
[68]M. Z. F. Nasution, O. S. Sitompul, and M. Ramli, “PCA based feature reduction to improve the accuracy of decision tree c4.5 classification,” J. Phys. Conf., vol. 978, pp. 012058, 2018.
[69]S. Priya and N. K. Karthikeyan, “A Heuristic and ANN based Classification Model for Early Screening of Cervical Cancer,” Int. J. Comput. Intell. Syst., vol. 13, pp. 1092-1100, 2020.
[70]H. D. Singh, Diagnosis of Cervical Cancer using Hybrid Machine Learning Models. Doctoral dissertation, Dublin, National College of Ireland, 2018.
[71]W. Wu and H. Zhou, “Data-driven diagnosis of cervical cancer with support vector machine-based approaches,” IEEE Access, vol. 5, pp. 25189-25195, 2017.
[72]N. T. Sagala, “A Comparative Study of Data Mining Methods to Diagnose Cervical Cancer,” J. Phys. Conf., vol. 1255, pp. 012022, 2019.
[73]M. Sharma, “Cervical cancer prognosis using genetic algorithm and adaptive boosting approach,” Health Technol. (Berl), vol. 9, pp. 877-886, 2019.
[74]S. F. Abdoh, M. A. Rizka, and F. A. Maghraby, “Cervical cancer diagnosis using random forest classifier with SMOTE and feature reduction techniques,” IEEE Access, vol. 6, pp. 59475-59485, 2018.
[75]R. Geetha, S. Sivasubramanian, M. Kaliappan, S. Vimal, and S. Annamalai, “Cervical cancer identification with synthetic minority oversampling technique and PCA analysis using random forest classifier,” J. Med. Syst., vol. 43, pp. 1-19, 2019.
[76]W. Yang, X. Gou, T. Xu, X. Yi, and M. Jiang, “Cervical Cancer Risk Prediction Model and Analysis of Risk Factors based on Machine Learning,” in Proceedings of the 2019 11th International Conference on Bioinformatics and Biomedical Technology, New York: Association for Computing Machinery, May 2019, pp. 50-54.
[77]Y. M. S. Al-Wesabi, A. Choudhury, and D. Won, “Classification of cervical cancer dataset,” in Proceedings of the 2018 IISE Annual Conference, IISE, December 2018, pp. 1456-1461.
[78]B. Nithya and V. Ilango, “Evaluation of machine learning based optimized feature selection approaches and classification methods for cervical cancer prediction,” SN Applied Sciences, vol. 1, pp. 1-6, 2019.
[79]R. Sawhney, P. Mathur, and R. Shankar, “A firefly algorithm based wrapper-penalty feature selection method for cancer diagnosis,” in International Conference on Computational Science and Its Applications, Cham: Springer, July 2018, pp. 438-449.
[80]A. K. Tripathi, P. Garg, A. Tripathy, N. Vats, D. Gupta, and A. Khanna, “Prediction of Cervical Cancer Using Chicken Swarm Optimization,” in International Conference on Innovative Computing and Communications, Singapore: Springer, 2020, pp. 591-604.
[81]M. F. Ijaz, M. Attique, and Y. Son, “Data-driven cervical cancer prediction model with outlier detection and over-sampling methods,” Sensors, vol. 20, pp. 2809, 2020.
[82]Prabhjot Kaur, Yashita Pruthi, Vidushi Bhatia, Janmjay Singh,"Empirical Analysis of Cervical and Breast Cancer Prediction Systems using Classification", International Journal of Education and Management Engineering (IJEME), Vol.9, No.3, pp.1-15, 2019.DOI: 10.5815/ijeme.2019.03.01
[83]Dhwaani Parikh, Vineet Menon,"Machine Learning Applied to Cervical Cancer Data", International Journal of Mathematical Sciences and Computing (IJMSC), Vol.5, No.1, pp.53-64, 2019.DOI: 10.5815/ijmsc.2019.01.05
[84]Kemal Akyol," A Study on Test Variable Selection and Balanced Data for Cervical Cancer Disease", International Journal of Information Engineering and Electronic Business (IJIEEB), Vol.10, No.5, pp. 1-7, 2018. DOI: 10.5815/ijieeb.2018.05.01
[85]G. Anna Lakshmi, S. Ravi," A Double Layered Segmentation Algorithm for Cervical Cell Images based on GHFCM and ABC", International Journal of Image, Graphics and Signal Processing (IJIGSP), Vol.9, No.11, pp. 39-47, 2017.DOI: 10.5815/ijigsp.2017.11.05