GraphConvDeep: A Deep Learning Approach for Enhancing Binary Code Similarity Detection using Graph Embeddings

PDF (1239KB), PP.72-88

Views: 0 Downloads: 0

Author(s)

Nandish M. 1,* Jalesh Kumar 1 Mohan H. G. 1 Manjunath Sargur Krishnamurthy 2

1. Dept. of Computer Science & Engineering, JNNCE Shivamogga, Visvesvaraya Technological University, Belagavi – 590018, India

2. JP Morgan & Chase Co., Houston, USA

* Corresponding author.

DOI: https://doi.org/10.5815/ijcnis.2025.03.05

Received: 28 Aug. 2024 / Revised: 9 Mar. 2025 / Accepted: 4 Apr. 2025 / Published: 8 Jun. 2025

Index Terms

Binary Code Similarity Detection, Deep Learning, Graph Convolution Networks, Graph Embedding

Abstract

Binary code similarity detection (BCSD) is a method for identifying similarities between two or more slices of binary code (machine code or assembly code) without access to their original source code. BCSD is often used in many areas, such as vulnerability detection, plagiarism detection, malware analysis, copyright infringement and software patching. Numerous approaches have been developed in these areas via graph matching and deep learning algorithms. Existing solutions have low detection accuracy and lack cross-architecture analysis. This work introduces a cross-platform graph deep learning-based approach, i.e., GraphConvDeep, which uses graph convolution networks to compute the embedding. The proposed GraphConvDeep approach relies on the control flow graph (CFG) of individual binary functions. By evaluating the distance between two embeddings of functions, the similarity is detected. The experimental results show that GraphConvDeep is better than other cutting-edge methods at accurately detecting similarities, achieving an average accuracy of 95% across different platforms. The analysis shows that the proposed approach achieves better performance with an area under the curve (AUC) value of 96%, particularly in identifying real-world vulnerabilities.

Cite This Paper

Nandish M., Jalesh Kumar, Mohan H. G., Manjunath Sargur Krishnamurthy, "GraphConvDeep: A Deep Learning Approach for Enhancing Binary Code Similarity Detection using Graph Embeddings", International Journal of Computer Network and Information Security(IJCNIS), Vol.17, No.3, pp.72-88, 2025. DOI:10.5815/ijcnis.2025.03.05

Reference

[1]Xu Y, Xu Z, Chen B, Song F, Liu Y, Liu T. “Patch based vulnerability matching for binary programs”, In: ISSTA 2020 - Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. Association for Computing Machinery, Inc, pp. 376–387, 2020. DOI:10.1145/3395363.3397361 
[2]Yang S, Xu Z, Xiao Y, Lang Z, Tang W, Liu Y, Shi Z, Li H, Sun L, “Towards Practical Binary Code Similarity Detection: Vulnerability Verification via Patch Semantic Analysis”, ACM Transactions on Software Engineering and Methodology, Vol.32, No.6, pp.1-29, 2023. DOI:10.1145/3604608 
[3]Hu X, Chiueh T, Shin KG, “Large-scale malware indexing using function-call graphs”, In: Proceedings of the 16th ACM conference on Computer and communications security. ACM, pp. 611–620, 2009. DOI:10.1145/1653662.1653736 
[4]Venkatasubramanian M, Lashkari AH, Hakak S, “IoT Malware Analysis Using Federated Learning: A Comprehensive Survey”, IEEE Access , Vol. 11, pp. 5004–5018, 2023. DOI:10.1109/ACCESS.2023.3235389 
[5]Venkatasubramanian M, “Federated Learning Assisted IoT Malware Detection Using Static Analysis”, In: Proceedings of the 2022 12th International Conference on Communication and Network Security, pp.  191 – 198, 2019. DOI:10.1145/3586102.3586131
[6]HaddadPajouh H, Dehghantanha A, Khayami R, Choo K-KR, “A deep Recurrent Neural Network based approach for Internet of Things malware threat hunting”, Future Generation Computer Systems, Vol. 85, pp. 88–96, 2018. DOI:10.1016/j.future.2018.03.007 
[7]Gao J, Yang X, Fu Y, Jiang Y, Sun J, “Vulseeker: A semantic learning based vulnerability seeker for cross-platform binary”, In: ASE 2018 - Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. Association for Computing Machinery, pp. 896–899, 2018. DOI:10.1145/3238147.3240480 
[8]Luo Z, Wang P, Wang B, Tang Y, Xie W, Zhou X, Liu D, Lu K, “VulHawk: Cross-architecture Vulnerability Detection with Entropy-based Binary Code Search”, In: 30th Annual Network and Distributed System Security Symposium, NDSS 2023. The Internet Society, pp.1-18, 2023. DOI:10.14722/ndss.2023.24415 
[9]Cheng X, Wang H, Hua J, Xu G, Sui Y, “DeepWukong: Statically Detecting Software Vulnerabilities Using Deep Graph Neural Network”, ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 3, pp.1-33, 2021. DOI:10.1145/3436877 
[10]Luo L, Ming J, Wu D, Liu P, Zhu S, “Semantics-Based Obfuscation-Resilient Binary Code Similarity Comparison with Applications to Software and Algorithm Plagiarism Detection”, IEEE Transactions on Software Engineering, Vol. 43, No. 12, pp. 1157–1177, 2017. DOI:10.1109/TSE.2017.2655046 
[11]Luo L, Ming J, Wu D, Liu P, Zhu S, “Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection”, In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, pp. 389–400, 2014. DOI:10.1145/2635868.2635900 
[12]Jin W, Chaki S, Cohen C, Gurfinkel A, Havrilla J, Hines C, Narasimhan P, “Binary Function Clustering Using Semantic Hashes”, In: 2012 11th International Conference on Machine Learning and Applications, IEEE, pp. 386–391, 2012.  DOI:10.1109/ICMLA.2012.70 
[13]Qiao Y, Yun X, Zhang Y, “Fast Reused Function Retrieval Method Based on Simhash and Inverted Index”, In: 2016 IEEE Trustcom/BigDataSE/ISPA, IEEE, pp. 937–944, 2016. DOI:10.1109/TrustCom.2016.0159 
[14]Gao D, Reiter MK, Song D, “BinHunt: Automatically Finding Semantic Differences in Binary Programs”, Information and Communications Security. ICICS 2008, Vol. 5308, pp. 238–255, 2008. DOI:10.1007/978-3-540-88625-9_16 
[15]Feng Q, Zhou R, Xu C, Cheng Y, Testa B, Yin H , “Scalable Graph-based Bug Search for Firmware Images”, In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, pp. 480–491, 2016. DOI:10.1145/2976749.2978370 
[16]Xu X, Liu C, Feng Q, Yin H, Song L, Song D, “Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection”, In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, ACM, pp. 363–376, 2017. DOI:10.1145/3133956.3134018 
[17]Guo J, Zhao B, Liu H, Leng D, An Y, Shu G, “DeepDual-SD: Deep Dual Attribute-Aware Embedding for Binary Code Similarity Detection”, International Journal of Computational Intelligence Systems, Vol. 16, No. 35, pp.1-11, 2023. DOI:10.1007/s44196-023-00206-9 
[18]“National Vulnerability Database, National Institute of Standards and Technology”, Accessed: 13 Apr 2023. [Online]. Available: https://nvd.nist.gov/ 
[19]Peng J, Wang Y, Xue J, Liu Z, “Fast Cross-Platform Binary Code Similarity Detection Framework Based on CFGs Taking Advantage of NLP and Inductive GNN”, Chinese Journal of Electronics 33, Vol. 33, No. 1, pp.128–138, 2024. DOI:10.23919/cje.2022.00.228 
[20]N. M, J. Kumar and M. H. G, "CrossDeep: A Hybrid Approach For Cross Version Binary Code Similarity Detection," 2024 Fourth International Conference on Multimedia Processing, Communication & Information Technology (MPCIT), Shivamogga, India, 2024, pp. 242-247, 2024. DOI:10.1109/MPCIT62449.2024.10892660.
[21]Zhu X, Jiang L, Chen Z, “Cross-platform binary code similarity detection based on NMT and graph embedding”, Mathematical Biosciences and Engineering, Vol. 18, No. 4, pp. 4528–4551, 2021. DOI:10.3934/mbe.2021230 
[22]Zuo F, Li X, Young P, Luo L, Zeng Q, Zhang Z, “Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs”, In: Proceedings 2019 Network and Distributed System Security Symposium, Internet Society, pp.1-15, 2019. DOI:10.14722/ndss.2019.23492 
[23]Wang Y, Jia P, Huang C, Liu J, He P, “Hierarchical Attention Graph Embedding Networks for Binary Code Similarity against Compilation Diversity”, Security and Communication Networks 2021, pp.1–19, 2021. DOI:10.1155/2021/9954520 
[24]Ji Y, Cui L, Huang HH, BugGraph: Differentiating Source-Binary Code Similarity with Graph Triplet-Loss Network. In: Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security. ACM, pp.702–715, 2021. DOI:10.1145/3433210.3437533 
[25]Tian D, Jia X, Ma R, Liu S, Liu W, Hu C, “BinDeep: A deep learning approach to binary code similarity detection”, Expert Syst Appl , Vol. 168, pp.1-8, 2021. DOI:10.1016/j.eswa.2020.114348 
[26]Yang S, Cheng L, Zeng Y, Lang Z, Zhu H, Shi Z, “Asteria: Deep Learning-based AST-Encoding for Cross-platform Binary Code Similarity Detection”, In: 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, pp. 224–236, 2021. DOI:10.1109/DSN48987.2021.00036 
[27]Chandramohan M, Xue Y, Xu Z, Liu Y, Cho CY, Tan HBK, “BinGo: cross-architecture cross-OS binary search”, In: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, pp. 678–689, 2016. DOI:10.1145/2950290.2950350 
[28]Eschweiler S, Yakdan K, Gerhards-Padilla E, “discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code”, In: Proceedings 2016 Network and Distributed System Security Symposium. Internet Society, pp.1-15, 2016. DOI:10.14722/ndss.2016.23185 
[29]Bromley J, Guyon I, Lecun Y, Sickinger E, Shah R, Bell A, Holmdel L, “Signature Verification using a “Siamese” Time Delay Neural Network”, NIPS'93: Proceedings of the 6th International Conference on Neural Information Processing Systems, pp. 737-744, 1994.
[30]Li X, Cheng Y , “Understanding the message passing in graph neural networks via power iteration clustering”, Neural Networks , Vol. 140, pp.130–135, 2021. DOI:10.1016/j.neunet.2021.02.025 
[31]“F. Chollet Keras: The Python Deep Learning library”, Accessed: 15 Mar 2023. [Online]. Available: https://ui.adsabs.harvard.edu/abs/2018ascl.soft06022C/abstract 
[32]“Binary Ninja”. Accessed: 1 Jan 2023. [Online]. Available:  https://binary.ninja/ 
[33]“The IDA Pro Disassembler and Debugger”, Accessed: 18 Jan 2023. [Online]. Available: https://hex-rays.com/ida-pro/ 
[34]“OpenSSL”, Accessed: 15 Aug 2023. [Online]. Available: https://github.com/openssl/openssl. 
[35]Ito N, Hashimoto M, Otsuka A, “Feature Extraction Methods for Binary Code Similarity Detection Using Neural Machine Translation Models”, IEEE Access 11, Vol. 11, pp. 102796–102805, 2023. DOI:10.1109/ACCESS.2023.3316215 
[36]Massarelli L, Di Luna GA, Petroni F, Querzoni L, Baldoni R, “Investigating Graph Embedding Neural Networks with Unsupervised Features Extraction for Binary Analysis”, In: Workshop on Binary Analysis Research Society, pp.1-11, 2019. DOI:10.14722/bar.2019.23020