Feature Selection: A Practitioner View

Full Text (PDF, 596KB), PP.66-77

Views: 0 Downloads: 0

Author(s)

Saptarsi Goswami 1,* Amlan Chakrabarti 2

1. Institute of Engineering & Management, Kolkata, India

2. A.K.Choudhury School of Information and Technology, Calcutta University, Kolkata, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2014.11.10

Received: 21 Feb. 2014 / Revised: 2 May 2014 / Accepted: 18 Jul. 2014 / Published: 8 Oct. 2014

Index Terms

Feature Selection, Supervised, Unsupervised, Commercial, Application Domain

Abstract

Feature selection is one of the most important preprocessing steps in data mining and knowledge Engineering. In this short review paper, apart from a brief taxonomy of current feature selection methods, we review feature selection methods that are being used in practice. Subsequently we produce a near comprehensive list of problems that have been solved using feature selection across technical and commercial domain. This can serve as a valuable tool to practitioners across industry and academia. We also present empirical results of filter based methods on various datasets. The empirical study covers task of classification, regression, text classification and clustering respectively. We also compare filter based ranking methods using rank correlation.

Cite This Paper

Saptarsi Goswami, Amlan Chakrabarti, "Feature Selection: A Practitioner View", International Journal of Information Technology and Computer Science(IJITCS), vol.6, no.11, pp.66-77, 2014. DOI:10.5815/ijitcs.2014.11.10

Reference

[1]Grantz, John & Reinsel David (2011) Extracting Value from Chaos , IDC I VI EW.

[2]Hilbert, M., & López, P. (2011). The world’s technological capacities to store, communicates, and compute information. Science, 332(6025), 60-65.

[3]Huan Liu, Lei Yu (2005) Toward Integrating Feature Selection Algorithms for Classification and Clustering , IEEE Transactions On Knowledge and Data Engineering, VOL. 17, NO. 4, April 2005

[4]Isabelle Guyon , Andr´e Elisseeff (2003) An Introduction to Variable and Feature Selection, Journal of Machine Learning Research 3 (2003) 1157-1182

[5]Liu, H., Motoda, H., Setiono, R., & Zhao, Z. (2010, June). Feature selection: An ever evolving frontier in data mining. In Proc. The Fourth Workshop on Feature Selection in Data Mining (Vol. 4, pp. 4-13).

[6]Yvan Saeys, In˜aki Inza and Pedro Larran˜aga (2007) A review of feature selection techniques in bioinformatics, Vol. 23 no. 19 2007, pages 2507–2517

[7]Hall, M. A. (1999). Correlation-based feature selection for machine learning (Doctoral dissertation, The University of Waikato).

[8]Ding, Chris, and Hanchuan Peng. "Minimum redundancy feature selection from microarray gene expression data." Journal of bioinformatics and computational biology 3.02 (2005): 185-205.

[9]Arauzo-Azofra, Antonio, José Luis Aznarte, and José M. Benítez. Empirical study of feature selection methods based on individual feature evaluation for classification problems. Expert Systems with Applications 38.7 (2011): 8170-8177.

[10]Yicong Liang, Qing Li, Tieyun Qian (2011) Finding Relevant Papers Based on Citation Relations , Lecture Notes in Computer Science Volume 6897, 2011, pp 403-414

[11]Xie, S., Zhang, J., & Ho, Y. S. (2008). Assessment of world aerosol research trends by bibliometric analysis. Scientometrics, 77(1), 113-130.

[12]Li, T., Ho, Y. S., & Li, C. Y. (2008). Bibliometric analysis on global Parkinson's disease research trends during 1991–2006. Neuroscience letters, 441(3), 248-252.

[13]Lutz Bornmann and Hans-Dieter Daniel (2008) what do citation counts measure? A review of studies on citing behavior, Journal of Documentation , pp. 45-80

[14]Varun Aggarwal and Sassoon Kosian (2011 ) Feature Selection and Dimension Reduction Techniques in SAS, NESUG 2011 

[15]Oracle® Data Mining Concepts 11g Release 1 (2008) http://docs.oracle.com/cd/B28359_01/datamine.111/b28129/feature_extr.htm , B28129-04

[16]Microsoft Developer Network (MSDN), Feature Selection (Data Mining), http://msdn.microsoft.com/en-us/library/ms175382.aspx

[17]Peng, C., Liu, G., & Xiang, L. (2013). Short-term electricity price forecasting using relief-correlation analysis based on feature selection and differential evolution support vector machine. Diangong Jishu Xuebao(Transactions of China Electrotechnical Society), 28(1), 277-284.

[18]Huang, C. L., & Tsai, C. Y. (2009). A hybrid SOFM-SVR with a filter-based feature selection for stock market forecasting. Expert Systems with Applications, 36(2), 1529-1539.

[19]Tsai, C. F. (2009). Feature selection in bankruptcy prediction. Knowledge-Based Systems, 22(2), 120-127

[20]Tsai, C. F., & Hsiao, Y. C. (2010). Combining multiple feature selection methods for stock prediction: Union, intersection, and multi-intersection approaches. Decision Support Systems, 50(1), 258-269.

[21]Lee, Ming-Chi. Using support vector machine with a hybrid feature selection method to the stock trend prediction. Expert Systems with Applications 36.8 (2009): 10896-10904.

[22]Duma, M., Twala, B., Nelwamondo, F. V., & Marwala, T. (2012). Partial imputation to improve predictive modelling in insurance risk classification using a hybrid positive selection algorithm and correlation-based feature selection. Current Science(Bangalore), 103(6), 697-705.

[23]Revett, K., Gorunescu, F., & Salem, A. (2009, October). Feature selection in Parkinson's disease: A rough sets approach. In Computer Science and Information Technology, 2009. IMCSIT'09. International Multiconference on (pp. 425-428). IEEE.

[24]Shilaskar, S., & Ghatol, A. (2013). Feature Selection for Medical Diagnosis: Evaluation for Cardiovascular Diseases. Expert Systems with Applications.

[25]Abeel, T., Helleputte, T., Van de Peer, Y., Dupont, P., & Saeys, Y. (2010). Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics, 26(3), 392-398.

[26]Wang, P., Hu, L., Liu, G., Jiang, N., Chen, X., Xu, J., ... & Chou, K. C. (2011). Prediction of antimicrobial peptides based on sequence alignment and feature selection methods. PLoS One, 6(4), e18476.

[27]Chaves, R., Ramírez, J., Górriz, J. M., López, M., Salas-Gonzalez, D., Alvarez, I., & Segovia, F. (2009). SVM-based computer-aided diagnosis of the Alzheimer's disease using t-test NMSE feature selection with feature correlation weighting. Neuroscience Letters, 461(3), 293-297

[28]Abbasi, A., Chen, H., & Salem, A. (2008). Sentiment analysis in multiple languages: Feature selection for opinion classification in Web forums. ACM Transactions on Information Systems (TOIS), 26(3), 12.

[29]Li, Y., Luo, C., & Chung, S. M. (2008). Text clustering with feature selection by using statistical data. Knowledge and Data Engineering, IEEE Transactions on, 20(5), 641-652.

[30]Chen, J., Huang, H., Tian, S., & Qu, Y. (2009). Feature selection for text classification with Naïve Bayes. Expert Systems with Applications, 36(3), 5432-5435.

[31]Rong, J., Li, G., & Chen, Y. P. P. (2009). Acoustic feature selection for automatic emotion recognition from speech. Information processing & management, 45(3), 315-328.

[32]Meher, Jayakishan, et al. "Cascaded Factor Analysis and Wavelet Transform Method for Tumor Classification Using Gene Expression Data." International Journal of Information Technology & Computer Science 4.9 (2012).

[33]Oliveira, A. L., Braga, P. L., Lima, R. M., & Cornélio, M. L. (2010). GA-based method for feature selection and parameters optimization for machine learning regression applied to software effort estimation. information and Software Technology, 52(11), 1155-1166.

[34]Catal, C., & Diri, B. (2009). Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. Information Sciences, 179(8), 1040-1058.

[35]Eri┼čti, H., Uçar, A., & Demir, Y. (2010). Wavelet-based feature extraction and selection for classification of power system disturbances using support vector machines. Electric power systems research, 80(7), 743-752.

[36]Tsang, C. H., Kwong, S., & Wang, H. (2007). Genetic-fuzzy rule mining approach and evaluation of feature selection techniques for anomaly intrusion detection. Pattern Recognition, 40(9), 2373-2391

[37]Nguyen, H., Franke, K., & Petrovic, S. (2010, February). Improving effectiveness of intrusion detection by correlation feature selection. In Availability, Reliability, and Security, 2010. ARES'10 International Conference on (pp. 17-24). IEEE.

[38]Amiri, F., Rezaei Yousefi, M., Lucas, C., Shakery, A., & Yazdani, N. (2011). Mutual information-based feature selection for intrusion detection systems. Journal of Network and Computer Applications, 34(4), 1184-1199.

[39]Bashir, K., Xiang, T., & Gong, S. (2008, March). Feature selection on gait energy image for human identification. In Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on (pp. 985-988). IEEE.

[40]Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

[41]R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL: http://www.R-project.org/.

[42]Ferreira, A. J., & Figueiredo, M. A. (2012). An unsupervised approach to feature discretization and selection. Pattern Recognition, 45(9), 3048-3060

[43]Witten, D. M., & Tibshirani, R. (2010). A framework for feature selection in clustering. Journal of the American Statistical Association, 105(490).

[44]Sarkar, Subhajit Dey, and Saptarsi Goswami. "Empirical Study on Filter based Feature Selection Methods for Text Classification." International Journal of Computer Applications 81 (2013).