A Dataset Centric Feature Selection and Stacked Model to Detect Breast Cancer

Full Text (PDF, 526KB), PP.24-37

Views: 0 Downloads: 0


Avijit Kumar Chaudhuri 1,* Dilip K. Banerjee 1 Anirban Das 2

1. Department of Computer Application, SEACOM SKILLS UNIVERSITY, Kendradangal, Bolpur, Dist:Birbhum, PIN - 731 236, West Bengal

2. University of Engineering & Management, Kolkata

* Corresponding author.

DOI: https://doi.org/10.5815/ijisa.2021.04.03

Received: 14 May 2021 / Revised: 17 Jun. 2021 / Accepted: 24 Jun. 2021 / Published: 8 Aug. 2021

Index Terms

Breast Cancer, Machine Learning, Feature Selection, Dataset Centric Approach, Ensemble Classifier


World Health Organisation declared breast cancer (BC) as the most frequent suffering among women and accounted for 15 percent of all cancer deaths. Its accurate prediction is of utmost significance as it not only prevents deaths but also stops mistreatments. The conventional way of diagnosis includes the estimation of the tumor size as a sign of plausible cancer. Machine learning (ML) techniques have shown the effectiveness of predicting disease. However, the ML methods have been method centric rather than being dataset centric. In this paper, the authors introduce a dataset centric approach(DCA) deploying a genetic algorithm (GA) method to identify the features and a learning ensemble classifier algorithm to predict using the right features. Adaboost is such an approach that trains the model assigning weights to individual records rather than experimenting on the splitting of datasets alone and perform hyper-parameter optimization. The authors simulate the results by varying base classifiers i.e, using logistic regression (LR), decision tree (DT), support vector machine (SVM), naive bayes (NB), random forest (RF), and 10-fold cross-validations with a different split of the dataset as training and testing. The proposed DCA model with RF and 10-fold cross-validations demonstrated its potential with almost 100% performance in the classification results that no research could suggest so far. The DCA satisfies the underlying principles of data mining: the principle of parsimony, the principle of inclusion, the principle of discrimination, and the principle of optimality. This DCA is a democratic and unbiased ensemble approach as it allows all features and methods in the start to compete, but filters out the most reliable chain (of steps and combinations) that give the highest accuracy. With fewer characteristics and splits of 50-50, 66-34, and 10 fold cross-validations, the Stacked model achieves 97 % accuracy. These values and the reduction of features improve upon prior research works.
Further, the proposed classifier is compared with some state-of-the-art machine-learning classifiers, namely random forest, naive Bayes, support-vector machine with radial basis function kernel, and decision tree. For testing the classifiers, different performance metrics have been employed – accuracy, detection rate, sensitivity, specificity, receiver operating characteristic, area under the curve, and some statistical tests such as the Wilcoxon signed-rank test and kappa statistics – to check the strength of the proposed DCA classifier. Various splits of training and testing data – namely, 50–50%, 66–34%, 80–20% and 10-fold cross-validation – have been incorporated in this research to test the credibility of the classification models in handling the unbalanced data. Finally, the proposed DCA model demonstrated its potential with almost 100% performance in the classification results. The output results have also been compared with other research on the same dataset where the proposed classifiers were found to be best across all the performance dimensions.

Cite This Paper

Avijit Kumar Chaudhuri, Dilip K. Banerjee, Anirban Das, "A Dataset Centric Feature Selection and Stacked Model to Detect Breast Cancer", International Journal of Intelligent Systems and Applications(IJISA), Vol.13, No.4, pp.24-37, 2021. DOI: 10.5815/ijisa.2021.04.03


[1]M. U. Sarwar, M. K. Hanif, R. Talib, A. Mobeen, and M. Aslam, “A survey of big data analytics in healthcare,” Int. J. Adv. Comput. Sci. Appl., vol. 8, pp. 355-359, 2017.
[2]E. W. Steyerberg, Clinical prediction models. Cham: Springer International Publishing, 2019, pp. 297-308.
[3]R. Ramani,N.Suthanthira Vanitha,S. Valarmathy,"The Pre-Processing Techniques for Breast Cancer Detection in Mammography Images", International Journal of Image, Graphics and Signal Processing, vol.5, no.5, pp.47-54, 2013.
[4]Prabhjot Kaur, Yashita Pruthi, Vidushi Bhatia, Janmjay Singh,"Empirical Analysis of Cervical and Breast Cancer Prediction Systems using Classification", International Journal of Education and Management Engineering, Vol.9, No.3, pp.1-15, 2019.
[5]A. K. Chaudhuri, D. Sinha, K. Bhattacharya, and A. Das, “An Integrated Strategy for Data Mining Based on Identifying Important and Contradicting Variables for Breast Cancer Recurrence Research,” Int. J. Recent Tech. Eng., vol. 8, March 2020.
[6]Bhagwati Charan Patel,G. R. Sinha,"Energy and Region based Detection and Segmentation of Breast Cancer Mammographic Images", International Journal of Image, Graphics and Signal Processing, vol.4, no.6, pp.44-51, 2012.
[7]C.D. Katsis, I. Gkogkou, C.A. Papadopoulos, Y. Goletsis, P.V. Boufounou, G. Stylios, "Using Artificial Immune Recognition Systems in Order to Detect Early Breast Cancer", International Journal of Intelligent Systems and Applications, vol.5, no.2, pp.34-40, 2013.
[8]D. Tripathi, I. Manoj, G. R. Prasanth, K. Neeraja, M. K. Varma, and B. R. Reddy, “Survey on classification and feature selection approaches for disease diagnosis,” in Emerging Research in Data Engineering Systems and Computer Communications, Singapore: Springer, 2020, pp. 567-576.
[9]M. Tubishat, N. Idris, L. Shuib, M. A. Abushariah, and S. Mirjalili, “Improved Salp Swarm Algorithm based on opposition based learning and novel local search algorithm for feature selection,” Expert Syst. Appl., vol. 145, pp. 113122, 2020.
[10]S. Maldonado, J. López, A. Jimenez-Molina, and H. Lira, “Simultaneous feature selection and heterogeneity control for SVM classification: An application to mental workload assessment,” Expert Syst. Appl., vol. 143, pp. 112988, 2020.
[11]M. Shouman, T. Turner, and R. Stocker, “Applying k-nearest neighbour in diagnosing heart disease patients,” Int. J. Inf. Educ. Technol., vol. 2, pp. 220-223, 2012.
[12]Y. Ji, S. Yu, and Y. Zhang, “A novel naive bayes model: Packaged hidden naive bayes,” in 2011 6th IEEE Joint International Information Technology and Artificial Intelligence Conference, vol. 2, IEEE, August 2011, pp. 484-487.
[13]B. Schölkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett, “New support vector algorithms,” Neural Comput., vol. 12, pp. 1207-1245, 2000.
[14]G. Cavallaro, M. Riedel, M. Richerzhagen, J. A. Benediktsson, and A. Plaza, “On understanding big data impacts in remotely sensed image classification using support vector machine methods,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 8, pp. 4634-4646, 2015.
[15]Y. Tang and J. Zhou, “The performance of PSO-SVM in inflation forecasting,” in 2015 12th International Conference on Service Systems and Service Management (ICSSSM), IEEE, June 2015, pp. 1-4.
[16]L. Breiman, “Random forests,” Mach. Learn, vol. 45, pp. 5-32, 2001.
[17]X. Chen, and H. Ishwaran, “Random forests for genomic data analysis,” Genomics, vol. 99, pp. 323-329, 2012.
[18]T. Shaikhina, D. Lowe, S. Daga, D. Briggs, R. Higgins, and N. Khovanova, “Decision tree and random forest models for outcome prediction in antibody incompatible kidney transplantation,” Biomed. Signal Process. Contr., vol. 52, pp. 456-462, 2019.
[19]M. A. Babyak, “What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models,” Psychosom. Med., vol. 66, pp. 411-421, 2004.
[20]H. Chouaib, O. R. Terrades, S. Tabbone, F. Cloppet, and N. Vincent, “Feature selection combining genetic algorithm and adaboost classifiers,” in 2008 19th International Conference on Pattern Recognition, IEEE, December 2008, pp. 1-4.
[21]M. Tolba and M. Moustafa, “GAdaBoost: accelerating adaboost feature selection with genetic algorithms,” arXiv preprint arXiv:1609.06260, 2016.
[22]M. Bramer, Principles of data mining, vol. 180. London: Springer, 2007.
[23]D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining: Adaptive Computation and Machine Learning. ISBN: 026208290X, 2001.
[24]T. Sridevi and A. Murugan, “A novel feature selection method for effective breast cancer diagnosis and prognosis,” Int. J. Comput. Appl., vol. 88, 2014.
[25]E. Aličković and A. Subasi, “Breast cancer diagnosis using GA feature selection and Rotation Forest,” Neural. Comput. Appl., vol. 28, pp. 753-763, 2017.
[26]P. Hamsagayathri and P. Sampath, “Performance analysis of breast cancer classification using decision tree classifiers,” Int. J. Curr. Pharm. Res., vol. 9, pp. 19-25, 2017.
[27]M. Abdar, M. Zomorodi-Moghadam, X. Zhou, R. Gururajan, X. Tao, P. D. Barua, and R. Gururajan, “A new nested ensemble technique for automated diagnosis of breast cancer,” Pattern Recognit. Lett., vol. 132, pp. 123-131, 2020.
[28]B. Zheng, S. W. Yoon, and S. S. Lam, “Breast cancer diagnosis based on feature extraction using a hybrid of K-means and support vector machine algorithms,” Expert Syst. Appl., vol. 41, pp. 1476-1482, 2014.
[29]M. Sewak, P. Vaidya, C. C. Chan, and Z. H. Duan, “SVM approach to breast cancer classification,” in Second International Multi-Symposiums on Computer and Computational Sciences (IMSCCS 2007), IEEE, August 2007, pp. 32-37.
[30]S. Y. Jin, J. K. Won, H. Lee, and H. J. Choi, “Construction of an automated screening system to predict breast cancer diagnosis and prognosis,” Basic Appl. Pathol., vol. 5, pp. 15-18, 2012.
[31]O. I. Obaid, M. A. Mohammed, M. K. A. Ghani, A. Mostafa, and F. Taha, “Evaluating the performance of machine learning techniques in the classification of Wisconsin Breast Cancer,” Int. J. Eng. Tech., vol. 7, pp. 160-166, 2018.
[32]S. Kumari and M. Arumugam, “Application of bio-inspired krill herd algorithm for breast cancer classification and diagnosis,” Indian J. Sci. Technol., vol. 8, pp. 30, 2015.
[33]A. Christobel and Y. Sivaprakasam, “An empirical comparison of data mining classification methods,” Int. J. Comput. Inf. Syst., vol. 3, pp. 24-28, 2011.
[34]D. Lavanya and D. K. U. Rani, “Analysis of feature selection with classification: Breast cancer datasets,” Indian J. Comput. Sci. Eng., vol. 2, pp. 756-763, 2011.
[35]A. Keleş, A. Keleş, and U. Yavuz, “Expert system based on neuro-fuzzy rules for diagnosis breast cancer,” Expert Syst. Appl., vol. 38, pp. 5719-5726, 2011.
[36]H. L. Chen, B. Yang, J. Liu, and D. Y. Liu, “A support vector machine classifier with rough set-based feature selection for breast cancer diagnosis,” Expert Syst. Appl., vol. 38, pp. 9014-9022, 2011.
[37]G. I. Salama, M. Abdelhalim, and M. A. E. Zeid, “Breast cancer diagnosis on three different datasets using multi-classifiers,” Int. J. Comput. Inform. Tech., vol. 1, pp. 36-43, September 2012.
[38]D. Lavanya and K. U. Rani, “Ensemble decision tree classifier for breast cancer data,” Int. J. Inf. Technol. Converg. Serv., vol. 2, pp. 17-24, 2012.
[39]W. Kim, K. S. Kim, J. E. Lee, D. Y. Noh, S. W. Kim, Y. S. Jung, and R. W. Park, “Development of novel breast cancer recurrence prediction model using support vector machine,” J. Breast Canc., vol. 15, pp. 230-238, 2012.
[40]G. R. Kumar, G. A. Ramachandra, and K. Nagamani, “An efficient prediction of breast cancer data using data mining techniques,” Int. J. Innov. Eng. Technol., vol. 2, pp. 139, 2013.
[41]S. Kharya, S. Agrawal, and S. Soni, “Naive Bayes classifiers: a probabilistic detection model for breast cancer,” Int. J. Comput. Appl., vol. 92, pp. 0975-8887, 2014.
[42]K. Sivakami and N. Saraswathi, “Mining big data: breast cancer prediction using DT-SVM hybrid model,” Int. J. Sci. Eng. Appl. Sci., vol. 1, pp. 418-429, 2015.
[43]S. L. Ang, H. C. Ong, and H. C. Low, “Classification Using the General Bayesian Network,” Pertanika J. Sci. Technol., vol. 24, 2016.
[44]A. K. Chaudhuri, D. Sinha, and K. S. Thyagaraj, “Identification of the recurrence of breast cancer by discriminant analysis,” in Emerging technologies in data mining and information security, Singapore: Springer, 2019, pp. 519-532.
[45]S. K. Trivedi and S. Dey, “A study of ensemble based evolutionary classifiers for detecting unsolicited emails,” in Proceedings of the 2014 conference on research in adaptive and convergent systems, October 2014, pp. 46-51.
[46]B. Xue, M. Zhang, W. N. Browne, and X. Yao, “A survey on evolutionary computation approaches to feature selection,” IEEE Trans. Evol. Comput., vol. 20, pp. 606-626, 2015.
[47]B. Yuan and X. Ma, “Sampling+ reweighting: Boosting the performance of AdaBoost on imbalanced datasets,” in The 2012 international joint conference on neural networks (IJCNN), IEEE, June 2012, pp. 1-6.
[48]Y. Freund and R. E. Schapire, “Experiments with a new boosting algorithm,” in ICML, vol. 96, July 1996, pp. 148-156.
[49]R. Sikora, “A modified stacking ensemble machine learning algorithm using genetic algorithms,” in Handbook of research on organizational transformations through big data analytics, IGI Global, 2015, pp. 43-53.
[50]B. Bhasuran, G. Murugesan, S. Abdulkadhar, and J. Natarajan, “Stacked ensemble combined with fuzzy matching for biomedical named entity recognition of diseases,” J Biomed. Informat., vol. 64, pp. 1-9, 2016.
[51]A. Ben-David, “Comparison of classification accuracy using Cohen’s Weighted Kappa,” Expert Syst. Appl., vol. 34, pp. 825-832, 2008.
[52]S. K. Trivedi and S. Dey, “Effect of feature selection methods on machine learning classifiers for detecting email spams,” in Proceedings of the 2013 Research in Adaptive and Convergent Systems, 2013, pp. 35-40.
[53]F. Wilcoxon, “Individual comparisons by ranking methods,” in Breakthroughs in statistics, New York: Springer, 1992, pp. 196-202.