Detection and Classification of Cross-language Code Clone Types by Filtering the Nodes of ANTLR-generated Parse Tree

Full Text (PDF, 971KB), PP.43-65

Views: 0 Downloads: 0


Sanjay B. Ankali 1,* Latha Parthiban 2

1. Dept. of CSE, KLE College of Engineering & Technology, Chikodi, India

2. Department of Computer Science, Pondicherry University, Community College, Lawspet, India-605008

* Corresponding author.


Received: 28 Jan. 2021 / Revised: 15 Feb. 2021 / Accepted: 15 Mar. 2021 / Published: 8 Jun. 2021

Index Terms

Cross-language clones, ANTLR parse tree, TF-IDF, cosine similarity, software forking


A complete and accurate cross-language clone detection tool can support software forking process that reuses the more reliable algorithms of legacy systems from one language code base to other. Cross-language clone detection also helps in building code recommendation system. This paper proposes a new technique to detect and classify cross-language clones of C and C++ programs by filtering the nodes of ANTLR-generated parse tree using a common grammar file, CPP14.g4. Parsing the input files using CPP14.g4 provides all the lexical and semantic information of input source code. Selective filtering of nodes performs serialization of two parse trees. Vector representation using term frequency inverse document frequency (TF-IDF) of the resultant tree is given as an input to cosine similarity to classify the clone types. Filtered parse tree of C and C++ increases the precision from 51% to 61%, and matching based on renaming the input/output expressions provides average precision of 91.97% and 95.37% for small scale and large scale repositories respectively. The proposed cross-language clone detection exhibits the highest precision of 95.37% in finding all types of clones (1, 2, 3 and 4) for 16,032 semantically similar clone pairs of C and CPP codes.

Cite This Paper

Sanjay B. Ankali, Latha Parthiban, "Detection and Classification of Cross-language Code Clone Types by Filtering the Nodes of ANTLR-generated Parse Tree", International Journal of Intelligent Systems and Applications(IJISA), Vol.13, No.3, pp.43-65, 2021. DOI:10.5815/ijisa.2021.03.05


[1] Wang, W. L. (2020). Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree. 27th International Conference on Software Analysis, Evolution and Reengineering (SANER) (pp. 261-271). IEEE.
[2] Chanchal K. Roy, J. R. (2009). Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of Computer Programming, 470-495.
[3] Chanchal Kumar Roy, J. R. (2007). A survey on software clone detection research. Queen’s School of Computing TR , 64-68.
[4] Dhavleesh Rattan, R. B. (2013). Software clone detection: A systematic review. Information and Software Technology , 1165-1199.
[5] Ain, Q. U. (2019). A systematic review on code clone detection. IEEE access, 86121-86144.
[6] Walker, A. a. (2020). Open-Source Tools and Benchmarks for Code-Clone Detection: Past, Present, and Future Trends. SIGAPP Appl. Comput. Rev., 28–39.
[7] Kim, S. S. (2017). VUDDY: a scalable approach for vulnerable code clone discovery. In Security and Privacy (SP), 2017 IEEE Symposium (pp. 595-614). San Jose, CA, USA: IEEE.
[8] Sajnani, H. V. (2016). SourcererCC: scaling code clone detection to big-code. Software Engineering (ICSE), 2016 IEEE/ACM 38th International Conference (pp. 1157-1168). Austin Texas: IEEE/ACM.
[9] Ragkhitwetsagul, C. K. (2019). Siamese: scalable and incremental code clone search via multiple code representations. Empir Software Eng. , 2236–2284.
[10] Philip Mayer, M. K. (2017). On multi-language soft-ware development, cross-language links and accompanying tools: a survey of professional software developers. Journal of Software Engineering Research andDevelopment.
[11] Henryk Krawczyk, Dawid Zima, "Automation in Software Source Code Development", International Journal of Information
Technology and Computer Science, Vol.8, No.12, pp.1-9, 2016.
[12] Zeba Khanam,S.A.M Rizvi,"Aspectual Analysis of Legacy Systems: Code Smells and Transformations in C", IJMECS, vol.5, no.11, pp.57-63, 2013.
[13] Godfrey, C. K. (2006). clones considered harmful. Reverse Engineering (WCRE’06) (pp. 19-28). Benevento, Italy: IEEE.
[14] Sadowski C, S. K. (2015). How developers search for code: a case study. ESEC/FSE (pp. 191–201). New York, NY, USA: Association for Computing Machinery.
[15] Acar Y, B. M. (2016). You get where you’re looking for: the impact of information sources on code security. SP, 289–305.
[16] Saini, V. a. (2018). Oreo: Detection of Clones in the Twilight Zone. Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (pp. 354–365). New York, NY, USA: Association for Computing Machinery.
[17] Abdalkareem, R. a. (2017). On Code Reuse from StackOverflow. Inf. Softw. Technol., 148–158.
[18] An L, M. O. (2017). Stack Overflow: a code laundering platform? SANER, 283–293.
[19] Baltes S, D. S. (2018). Usage and attribution of Stack Overflow code snippets in GitHub projects. EmpirSoftw Eng, 1–37.
[20] Mimoun, M. A. (2015). Clone detection using time series and dynamic time warping techniques. Third World Conference on Complex Systems (WCCS) (pp. 1-6). Marrakech: IEEE.
[21] Nafi, Kawser & Roy, Banani & Roy, Chanchal & Schneider, Kevin. (2019). A Universal Cross Language Software Similarity Detector for Open Source Software Categorization. Journal of Systems and Software. 162. 110491. 10.1016/j.jss.2019.110491.
[22] George Mathew, C. P. (2020). SLACC: Simion-based Language Agnostic Code Clones, arXiv:2002.03039 [cs.SE]. Accepted at ICSE 2020 technical track, (p. 11).
[23] Karnalim, O. (2020). TF-IDF Inspired Detection for Cross-Language Source Code Plagiarism and Collusion. .
[24] Perez, D. a. (2019). Cross-Language Clone Detection by Learning over Abstract Syntax Trees. 16th International Conference on Mining Software Repositories (pp. 518–528). Montreal, Quebec, Canada: IEEE Press.
[25] K. W. Nafi, T. S. (2019). CLCDSA: Cross Language Code Clone Detection using Syntactical Features and API Documentation. 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), (pp. 1026-1037). San Diego, CA, USA: IEEE.
[26] Nghi D. Q. Bui, L. J. (2017). Cross-Language Learning for Program Classification using Bilateral Tree-Based Convolutional Neural Networks. arXiv:1710.06159.
[27] Hu, Y. a. (2017). Binary Code Clone Detection across Architectures and Compiling Configurations. Proceedings of the 25th International Conference on Program Comprehension (pp. 88-98). Buenos Aires, Argentina: IEEE.
[28] T. Vislavski, G. R. (2018). LICCA: A tool for cross-language clone detection. 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER) (pp. 512-516). Campobasso: IEEE.
[29] Cheng, X. P. (2017). CLCMiner: Detecting Cross-Language Clones without Intermediates. IEICE Transactions on Information and Systems, 273-284.
[30] Al-Omari, F. K. (2012). Detecting Clones Across Microsoft .NET Programming Languages. 19th Working Conference on Reverse Engineering (pp. 405-414). IEEE.
[31] Nichols, L. (2017). Retrieved April 10, 2020, from
[32] Naumann, F. (2013). SImilarity Measures.
[33] Parr, T. (2014). ANTLR. Retrieved April 10, 2020, from
[34] Parr, T. (2014). Retrieved April 10, 2020, from
[35] Thome, J. (n.d.). Retrieved April 10, 2020, from
[36] Christopher D. Manning, P. R. (2008). Introduction to Information Retrieval. Cambridge: Cambridge University Press.
[37] Juan Zheng, W. X. (2019). Examining sequential patterns of self- and socially shared regulation of STEM learning in a CSCL environment. Computers & Education, 34-48.
[38] Goel, E. B. (2014). LectureKhoj: Automatic tagging and semantic segmentation of online lecture videos. Seventh International Conference on Contemporary Computing (IC3). (pp. 7-43). Noida, 2014: IEEE.
[39] Elhadad, M. K. (2018). A Novel Approach for Ontology-Based Feature Vector Generation for Web Text Document Classification. Int. J. Softw. Innov. 1-10.