Improved Deep Learning Model for Static PE Files Malware Detection and Classification

Full Text (PDF, 396KB), PP.14-26

Views: 0 Downloads: 0


Sumit S. Lad 1,* Amol C. Adamuthe 2

1. Dept. of CSE, Rajarambapu Institute of Technology, Rajaramnagar, Sangli, Maharashtra, India

2. Dept. of CS & IT, Rajarambapu Institute of Technology, Rajaramnagar, Sangli, Maharashtra, India

* Corresponding author.


Received: 5 Jun. 2021 / Revised: 8 Sep. 2021 / Accepted: 13 Oct. 2021 / Published: 8 Apr. 2022

Index Terms

Static malware analysis, Deep learning, Static PE files classification


Static analysis and detection of malware is a crucial phase for handling security threats. Most researchers stated that the problem with the static analysis is an imbalance in the dataset, causing invalid result metrics. It requires more time for extracting features from the raw binaries, and methods like neural networks require more time for the training. Considering these problems, we proposed a model capable of building a feature set from the dataset and classifying static PE files efficiently.  The research work was conducted to emphasize the importance of feature extraction rather than focusing on model building. The well-extracted features help to provide better results when fed to neural networks with minimal numbers of layers. Using minimum layers will enhance the performance of the model and take fewer resources and time for the processing and evaluation. In this research work, EMBER datasets published by Endgame Inc. containing PE file information are used. Feature extraction, data standardization, and data cleaning techniques are performed to handle the imbalance and impurities from the dataset. Later the extracted features were scaled into a standard form to avoid the problems related to range variations. A total of 2381 features are extracted and pre-processed from both the 2017 and 2018 datasets, respectively. 

The pre-processed data is then given to a deep learning model for training. The deep learning model created using dense and dropout layers to minimize the resource strain on the model and deliver more accurate results in less amount of time. The results obtained during experimentation for EMBER v2017 and v2018 datasets are 97.53% and 94.09%, respectively. The model is trained for ten epochs with a learning rate of 0.01, and it took 4 minutes/epoch, which is one minute lesser than the Decision Tree model. In terms of precision metrics, our model achieved 98.85%, which is 1.85% more as compared to the existing models. 

Cite This Paper

Sumit S. Lad., Amol C. Adamuthe, "Improved Deep Learning Model for Static PE Files Malware Detection and Classification", International Journal of Computer Network and Information Security(IJCNIS), Vol.14, No.2, pp.14-26, 2022. DOI: 10.5815/ijcnis.2022.02.02


[1] 2020. Malware Statistics & Trends Report | AV-TEST. [online] Available at: [Accessed 14 August 2020].

[2] Li, Bo, Kevin Roundy, Chris Gates, and Yevgeniy Vorobeychik. “Large-scale identification of malicious singleton files.” In Proceedings of the Seventh ACM on Conference on Data and Application Security and Privacy, pp. 227-238. 2017.

[3] Firdausi, Ivan, Alva Erwin, and Anto Satriyo Nugroho. “Analysis of machine learning techniques used in behavior-based malware detection.” In 2010 second international conference on advances in computing, control, and telecommunication technologies, pp. 201-203. IEEE, 2010.

[4] Damodaran, Anusha, Fabio Di Troia, Corrado Aaron Visaggio, Thomas H. Austin, and Mark Stamp. “A comparison of static, dynamic, and hybrid analysis for malware detection.” Journal of Computer Virology and Hacking Techniques 13, no. 1 (2017): 1-12.

[5] Raffetseder, Thomas, Christopher Kruegel, and Engin Kirda. “Detecting system emulators.” In International Conference on Information Security, pp. 1-18. Springer, Berlin, Heidelberg, 2007.

[6] Garfinkel, Tal, Keith Adams, Andrew Warfield, and Jason Franklin. “Compatibility Is Not Transparency: VMM Detection Myths and Realities.” In HotOS. 2007.

[7] Carpenter, Matthew, Tom Liston, and Ed Skoudis. “Hiding virtualization from attackers and malware.” IEEE Security & Privacy 5, no. 3 (2007): 62-65.

[8] Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. “Going deeper with convolutions.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1-9. 2015.

[9] Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. “Speech recognition with deep recurrent neural networks.” In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6645-6649. IEEE, 2013.

[10] Zhang, Xiang, and Yann LeCun. “Text understanding from scratch.” arXiv preprint arXiv:1502.01710 (2015).

[11] Schultz, Matthew G., Eleazar Eskin, F. Zadok, and Salvatore J. Stolfo. “Data mining methods for detection of new malicious executables.” In Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001, pp. 38-49. IEEE, 2000.

[12] Kolter, J. Zico, and Marcus A. Maloof. “Learning to detect and classify malicious executables in the wild.” Journal of Machine Learning Research 7, no. Dec (2006): 2721-2744.

[13] Tian, Ronghua, Lynn Batten, Rafiqul Islam, and Steve Versteeg. “An automated classification system based on the strings of trojan and virus families.” In 2009 4th International Conference on Malicious and Unwanted Software (MALWARE), pp. 23-30. IEEE, 2009.

[14] McDermott, J., James Kirby, B. Montrose, Travis Johnson, and M. Kang. “Re-engineering Xen internals for higher-assurance security.” Information Security Technical Report 13, no. 1 (2008): 17-24.

[15] Al-Dujaili, Abdullah, Alex Huang, Erik Hemberg, and Una-May O’Reilly. “Adversarial deep learning for robust detection of binary encoded malware.” In 2018 IEEE Security and Privacy Workshops (SPW), pp. 76-82. IEEE, 2018.

[16] Nataraj, Lakshmanan, Sreejith Karthikeyan, Gregoire Jacob, and Bengaluru (Bengaluru) S. Manjunath. “Malware images: visualization and automatic classification.” In Proceedings of the 8th international symposium on visualization for cyber security, pp. 1-7. 2011.

[17] Owezarski, Philippe. “Unsupervised classification and characterization of honeypot attacks.” In 10th International Conference on Network and Service Management (CNSM) and Workshop, pp. 10-18. IEEE, 2014.

[18] Wang, Cheng, Jianmin Pang, Rongcai Zhao, and Xiaoxian Liu. “Using API sequence and Bayes algorithm to detect suspicious behavior.” In 2009 International Conference on Communication Software and Networks, pp. 544-548. IEEE, 2009.

[19] Xu, J-Y., Andrew H. Sung, Patrick Chavez, and Srinivas Mukkamala. “Polymorphic malicious executable scanner by API sequence analysis.” In Fourth International Conference on Hybrid Intelligent Systems (HIS’04), pp. 378-383. IEEE, 2004.

[20] Ye, Yanfang, Tao Li, Qingshan Jiang, and Youyu Wang. “CIMDS: adapting postprocessing techniques of associative classification for malware detection.” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 40, no. 3 (2010): 298-307.

[21] Ugarte-Pedrero, Xabier, Igor Santos, Borja Sanz, Carlos Laorden, and Pablo Garcia Bringas. “Countering entropy measure attacks on packed software detection.” In 2012 IEEE Consumer Communications and Networking Conference (CCNC), pp. 164-168. IEEE, 2012.

[22] Saxe, Joshua, and Konstantin Berlin. “Deep neural network-based malware detection using two-dimensional binary program features.” In 2015 10th International Conference on Malicious and Unwanted Software (MALWARE), pp. 11-20. IEEE, 2015.

[23] Raff, Edward, Jared Sylvester, and Charles Nicholas. “Learning the pe header, malware detection with minimal domain knowledge.” In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 121-132. 2017.

[24] Van Nhuong, Nguyen, Vo Thi Yen Nhi, Nguyen Tan Cam, Mai Xuan Phu, and Cao Dang Tan. “Semantic set analysis for malware detection.” In IFIP International Conference on Computer Information Systems and Industrial Management, pp. 688-700. Springer, Berlin, Heidelberg, 2015.

[25] Nguyen, Vu Thanh, Toan Tan Nguyen, Khang Trong Mai, and Tuan Dinh Le. “A combination of negative selection algorithm and artificial immune network for virus detection.” In International Conference on Future Data and Security Engineering, pp. 97-106. Springer, Cham, 2014.

[26] Nath, Hiran V., and Babu M. Mehtre. “Static malware analysis using machine learning methods.” In International Conference on Security in Computer Networks and Distributed Systems, pp. 440-450. Springer, Berlin, Heidelberg, 2014.

[27] Kolter, J. Zico, and Marcus A. Maloof. “Dynamic weighted majority: An ensemble method for drifting concepts.” Journal of Machine Learning Research 8, no. Dec (2007): 2755-2790.

[28] Kolter, Jeremy Z., and Marcus A. Maloof. “Learning to detect malicious executables in the wild.” In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 470-478. 2004.

[29] Kolter, Jeremy Z., and Marcus A. Maloof. “Using additive expert ensembles to cope with concept drift.” In Proceedings of the 22nd international conference on Machine learning, pp. 449-456. 2005.

[30] Dube, Thomas, Richard Raines, Gilbert Peterson, Kenneth Bauer, Michael Grimaila, and Steven Rogers. “Malware type recognition and cyber situational awareness.” In 2010 IEEE Second International Conference on Social Computing, pp. 938-943. IEEE, 2010.

[31] Dube, Thomas, Richard Raines, Gilbert Peterson, Kenneth Bauer, Michael Grimaila, and Steven Rogers. “Malware target recognition via static heuristics.” Computers & Security 31, no. 1 (2012): 137-147.

[32] Dube, Thomas E. A novel malware target recognition architecture for enhanced cyberspace situation awareness. No. AFIT/DCE/ENG/11-07. AIR FORCE INST OF TECH WRIGHT-PATTERSON AFB OH SCHOOL OF ENGINEERING AND MANAGEMENT, 2011.

[33] Dube, Thomas E., Richard A. Raines, Michael R. Grimaila, Kenneth W. Bauer, and Steven K. Rogers. “Malware target recognition of unknown threats.” IEEE Systems Journal 7, no. 3 (2012): 467-477.

[34] Dube, Thomas E., Richard A. Raines, and Steven K. Rogers. “Malware target recognition.” U.S. Patent 8,756,693 issued June 17, 2014.

[35] Zhang, Boyun, Jianping Yin, Jingbo Hao, Dingxing Zhang, and Shulin Wang. “Malicious codes detection based on ensemble learning.” In International conference on autonomic and trusted computing, pp. 468-477. Springer, Berlin, Heidelberg, 2007.

[36] Lyda, Robert, and James Hamrock. “Using entropy analysis to find encrypted and packed malware.” IEEE Security & Privacy 5, no. 2 (2007): 40-45.

[37] Santos, Igor, Felix Brezo, Borja Sanz, Carlos Laorden, and Pablo Garcia Bringas. “Using opcode sequences in single-class learning to detect unknown malware.” IET information security 5, no. 4 (2011): 220-227.

[38] Santos, Igor, Felix Brezo, Xabier Ugarte-Pedrero, and Pablo G. Bringas. “Opcode sequences as representation of executables for data-mining-based unknown malware detection.” Information Sciences 231 (2013): 64-82.

[39] Santos, Igor, Javier Nieves, and Pablo G. Bringas. “Semi-supervised learning for unknown malware detection.” In International Symposium on Distributed Computing and Artificial Intelligence, pp. 415-422. Springer, Berlin, Heidelberg, 2011.

[40] Bilar, Daniel. “Opcodes as predictor for malware.” International journal of electronic security and digital forensics 1, no. 2 (2007): 156-168.

[41] Shafiq, M. Zubair, S. Tabish, and Muddassar Farooq. “PE-probe: leveraging packer detection and structural information to detect malicious portable executables.” In Proceedings of the Virus Bulletin Conference (VB), vol. 8. 2009.

[42] Shafiq, M. Zubair, S. Momina Tabish, Fauzan Mirza, and Muddassar Farooq. A framework for efficient mining of structural information to detect zero-day malicious portable executables. Technical Report, TR-nexGINRC-2009-21, January 2009, available at http://www. nexginrc. org/papers/tr21-zubair. pdf, 2009.

[43] Shafiq, M. Zubair, S. Momina Tabish, Fauzan Mirza, and Muddassar Farooq. “Pe-miner: Mining structural information to detect malicious executables in realtime.” In International Workshop on Recent Advances in Intrusion Detection, pp. 121-141. Springer, Berlin, Heidelberg, 2009.

[44] Liu, Huan, and Hiroshi Motoda, eds. Feature extraction, construction, and selection: A data mining perspective. Vol. 453. Springer Science & Business Media, 1998.

[45] “LIEF - Library to Instrument Executable Formats.” 2020. Lief.Quarkslab.Com. [Accessed 14 August 2020].

[46] Rahm, Erhard, and Hong Hai Do. “Data cleaning: Problems and current approaches.” IEEE Data Eng. Bull. 23, no. 4 (2000): 3-13.

[47] Kassambara, Alboukadel. Practical guide to cluster analysis in R: Unsupervised machine learning. Vol. 1. Sthda, 2017.

[48] "Sklearn.Preprocessing. Standardscaler — Scikit-Learn 0.23.2 Documentation". 2020. Scikit-Learn.Org. [Accessed 14 August 2020].

[49] Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. “Dropout: a simple way to prevent neural networks from overfitting.” The journal of machine learning research 15, no. 1 (2014): 1929-1958.

[50] Chollet, Francois. Deep Learning mit Python und Keras: Das Praxis-Handbuch vom Entwickler der Keras-Bibliothek. MITP-Verlags GmbH & Co. KG, 2018.

[51] “Classification With Tensorflow and Dense Neural Networks.” 2020. Medium. a#:~:text=What%20is%20a%20dense%20neural%20network%3F&text=Each%20neuron%20in% 20a%20layer,those%20in%20the%20next%20layer. [Accessed 14 August 2020].

[52] “Endgameinc/Ember.” 2020. Github. [Accessed 14 August 2020].

[53] Powers, David Martin. “Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation.” (2011).

[54] Oyama, Yoshihiro, Takumi Miyashita, and Hirotaka Kokubo. “Identifying Useful Features for Malware Detection in the Ember Dataset.” In 2019 Seventh International Symposium on Computing and Networking Workshops (CANDARW), pp. 360-366. IEEE, 2019.

[55] Pham, Huu-Danh, Tuan Dinh Le, and Thanh Nguyen Vu. “Static PE malware detection using gradient boosting decision trees algorithm.” In International Conference on Future Data and Security Engineering, pp. 228-236. Springer, Cham, 2018.

[56] Raff, Edward, Jon Barker, Jared Sylvester, Robert Brandon, Bryan Catanzaro, and Charles Nicholas. “Malware detection by eating a whole exe.” arXiv preprint arXiv:1710.09435 (2017).

[57] Anderson, Hyrum S., and Phil Roth. “Ember: an open dataset for training static pe malware machine learning models.” arXiv preprint arXiv:1804.04637 (2018).

[58] Pham, Huu-Danh, Tuan Dinh Le, and Thanh Nguyen Vu. “Static PE malware detection using gradient boosting decision trees algorithm.” In International Conference on Future Data and Security Engineering, pp. 228-236. Springer, Cham, 2018.

[59] Wu, Cangshuai, Jiangyong Shi, Yuexiang Yang, and Wenhua Li. “Enhancing machine learning based malware detection model by reinforcement learning.” In Proceedings of the 8th International Conference on Communication and Network Security, pp. 74-78. 2018.

[60] Pramanik, Subhojeet, and Hemanth Teja. “EMBER-Analysis of Malware Dataset Using Convolutional Neural Networks.” In 2019 Third International Conference on Inventive Systems and Control (ICISC), pp. 286-291. IEEE, 2019.

[61] Tumsa, Sisay. "Application of Artificial Neural Networks for Detecting Malicious Embedded Codes in Word Processing Documents." Pezzottaite Journals 8, no. 2 (2019).

[62] Alile S.O , Egwali A.O, " A Bayesian Belief Network Model For Detecting Multi-stage Attacks With Malicious IP Addresses ", International Journal of Wireless and Microwave Technologies, Vol.10, No.2, pp. 30-41, 2020.

[63] Rabia Tahir,"A Study on Malware and Malware Detection Techniques", International Journal of Education and Management Engineering, Vol.8, No.2, pp.20-30, 2018.