A New Hybrid Genetic and Information Gain Algorithm for Imputing Missing Values in Cancer Genes Datasets

Full Text (PDF, 1197KB), PP.20-33

Views: 0 Downloads: 0


O. M. Elzeki 1,* M. F. Alrahmawy 1 Samir Elmougy 1

1. Faculty of Computers and Information, Mansoura University, Mansoura, dakahliya, Egypt

* Corresponding author.

DOI: https://doi.org/10.5815/ijisa.2019.12.03

Received: 17 Jul. 2019 / Revised: 15 Aug. 2019 / Accepted: 12 Sep. 2019 / Published: 8 Dec. 2019

Index Terms

Data Mining, Genetic Algorithm, Information Gain, Missing Values Imputation, DNA Microarray, Classification


A DNA microarray can represent thousands of genes for studying tumor and genetic diseases in humans. Datasets of DNA microarray normally have missing values, which requires an undeniably crucial process for handling missing values. This paper presents a new algorithm, named EMII, for imputing missing values in medical datasets. EMII algorithm evolutionarily combines Information Gain (IG) and Genetic Algorithm (GA) to mutually generate imputable values. EMII algorithm is column-oriented not instance oriented than other implementation of GA which increases column correlation to the class in the same dataset. EMII algorithm is evaluated for imputing the generated missing values in four cancer gene expression standard medical datasets (Colon, Leukemia, Lung cancer-Michigan, and Prostate) via comparing the truth original complete datasets against the imputed datasets. The analysis of the experimental results reveals that the imputed values generated by EMII were almost the same as the original values besides having the same impact on the applied classifiers due to accuracy as similar as the original complete datasets. EMII has a running time of θ(n2), where n is the total number of columns.

Cite This Paper

O. M. Elzeki, M. F. Alrahmawy, Samir Elmougy, "A New Hybrid Genetic and Information Gain Algorithm for Imputing Missing Values in Cancer Genes Datasets", International Journal of Intelligent Systems and Applications(IJISA), Vol.11, No.12, pp.20-33, 2019. DOI:10.5815/ijisa.2019.12.03


[1]J. Li Y. Wang, Y. Cao, “Weighted doubly regularized support vector machine and its application to microarray classification with noise,” Neurocomputing, vol. 173, no. 5, pp. 595–605, 2016.
[2]X. Liu, S. Wang, H. Zhang, H. Zhang, Z. Yang, and Y. Liang, “Novel regularization method for biomarker selection and cancer classification,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, pp. 1–1, 2019.
[3]H.-H. Huang X. Y. Liu and Y. Liang, “Feature selection and cancer classification via sparse logistic regression with the hybrid L1/2+2 regularization,” PLoS ONE, vol. 11, no. 5, 2016.
[4]P. Chanda, K. Manali, R. Dhananjay, and W. Dipak, “Imputation of Missing Gene Expressions for DNA Microarray Using Particle Swarm Optimization,” in Proceedings of the Second International Conference on Computer and Communication Technologies: IC3T 2015, Volume 3, S. C. Satapathy, K. S. Raju, J. K. Mandal, and V. Bhateja, Eds. New Delhi: Springer India, 2016, pp. 65–74.
[5]P. Baraldi, F. D. Maio, D. Genini, and E. Zio, “Reconstruction of missing data in multidimensional time series by fuzzy similarity,” Applied Soft Computing, vol. 26, no. Supplement C, pp. 1–9, 2015.
[6]J. Li W. Dong and D. Meng, “Grouped gene selection of cancer via adaptive sparse group lasso based on conditional mutual information,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 15, no. 6, pp. 2028–2038, 2017.
[7]O. T. M. C. G. S. P. B. T. H. R. T. D. B. R. B. Altman, “Missing value estimation methods for DNA microarrays,” Bioinformatics, vol. 17, no. 6, pp. 520–525, Jun. 2001.
[8]L. Fabio et al., “Multi-objective genetic algorithm for missing data imputation,” Pattern Recognition Letters, vol. 68, pp. 126–131, 2015.
[9]S. Waseem, R. Qamar, and A. Ejaz, “Missing Data Imputation using Genetic Algorithm for Supervised Learning,” INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, vol. 8, no. 3, pp. 438–445, 2017.
[10]O. A. Alomari, A. T. Khader, M. A. Al-Betar, and Z. A. A. Alyasseri, “A Hybrid Filter-Wrapper Gene Selection Method for Cancer Classification,” in 2018 2nd International Conference on BioSignal Analysis, Processing and Systems (ICBAPS), 2018, pp. 113–118.
[11]C. T. Tran, M. Zhang, and P. Andreae, “Multiple imputation for missing data using genetic programming,” in Proceedings of the 2015 annual conference on genetic and evolutionary computation, 2015, pp. 583–590.
[12]J. C. Hernandez, B. Duval, and J. K. Hao, “A genetic embedded approach for gene selection and classification of microarray data,” EvoBio’07 Lecture Notes in Computer Science, p. 90–101, 2007.
[13]T. Nguyen, A. Khosravi, D. Creighton, and S. Nahavandi, “Hidden Markov models for cancer classification using gene expression profiles,” Information Sciences, vol. 316, no. Supplement C, pp. 293–307, 2015.
[14]J. Shi and Z. Luo, “Missing value estimation for DNA microarray gene expression data with principal curves,” in 2010 International Conference on Bioinformatics and Biomedical Technology, 2010, pp. 262–265.
[15]A. Wojtowicz, P. Zywica, A. Stachowiak, and K. Dyczkowski, “Solving the problem of incomplete data in medical diagnosis via interval modeling,” Applied Soft Computing, vol. 47, no. Supplement C, pp. 424–437, 2016.
[16]C. Zhong, W. Pedrycz, D. Wang, L. Li, and Z. Li, “Granular data imputation: A framework of Granular Computing,” Applied Soft Computing, vol. 46, no. Supplement C, pp. 307–316, 2016.
[17]K. Deb, Multi-Objective Optimization using Evolutionary Algorithms. Wiley, 2001.
[18]P. S. Oliveto, J. He, and X. Yao, “Time Complexity of Evolutionary Algorithms for Combinatorial Optimization: A Decade of Results,” International Journal of Automation and Computing, vol. 4, no. 1, pp. 100–106, Jan. 2007.
[19]Z. Y. X. Liu Y. Liang, S. Wang and H. Ye, “A Hybrid Genetic Algorithm With Wrapper-Embedded Approaches for Feature Selection,” IEEE Access, vol. 6, pp. 22863–22874, Mar. 2018.
[20]Y. Wang, G. Yang, and Y. Lu, “Informative gene selection for microarray classification via adaptive elastic net with conditional mutual information,” Applied Mathematical Modelling, vol. 71, no. 5439, pp. 286–297, Jul. 2019.
[21]D. Singh et al., “Gene expression correlates of clinical prostate cancer behavior,” Cancer Cell, vol. 1, no. 2, pp. 203–209, 2002.
[22]M. Shams, A. Tolba, and S. Sarhan, “A Vision System for Multi-View Face Recognition,” International Journal of Circuits, Systems, and Signal Processing, vol. 10, no. 1, pp. 455–461, 2017.
[23]A. A. Goshtasby, Image Registration: Principles, Tools and Methods. Springer London, 2014.
[24]H. Salem, G. Attiya, and N. El-Fishawy, “Classification of human cancer diseases by gene expression profiles,” Applied Soft Computing, vol. 50, no. Supplement C, pp. 124–134, 2017.
[25]K. J. Danjuma, “Performance evaluation of machine learning algorithms in post-operative life expectancy in the lung cancer patients,” International Journal of Computer Science Issues, vol. 12, no. 2, 2015.
[26]A. Paul, J. Sil, and C. D. Mukhopadhyay, “Gene selection for designing optimal fuzzy rule base classifier by estimating missing value,” Applied Soft Computing, vol. 55, no. Supplement C, pp. 276–288, 2017.