Estimating Missing Security Vectors in NVD Database Security Reports

Full Text (PDF, 616KB), PP.1-13

Views: 0 Downloads: 0


Hakan Kekul 1,2,* Burhan Ergen 3 Halil ARSLAN 4

1. University of Fırat, Institute of Science, Elazığ Turkey

2. Sivas Information Technology Technical High School, Diriliş Mahallesi Rüzgarli Sokak No 21 Sivas, Turkey

3. University of Fırat, Faculty of Engineering, Computer Engineering Department, Elazığ Turkey

4. University of Sivas Cumhuriyet, Faculty of Engineering, Computer Engineering Department, Sivas Turkey

* Corresponding author.


Received: 11 Mar. 2022 / Revised: 8 Apr. 2022 / Accepted: 28 Apr. 2022 / Published: 8 Jun. 2022

Index Terms

Software Security, Software Vulnerability, Information security, Text Analysis, Multiclass Classification


Detection and analysis of software vulnerabilities is a very important consideration. For this reason, software security vulnerabilities that have been identified for many years are listed and tried to be classified. Today, this process, performed manually by experts, takes time and is costly. Many methods have been proposed for the reporting and classification of software security vulnerabilities. Today, for this purpose, the Common Vulnerability Scoring System is officially used. The scoring system is constantly updated to cover the different security vulnerabilities included in the system, along with the changing security perception and newly developed technologies. Different versions of the scoring system are used with vulnerability reports. In order to add new versions of the published scoring system to the old vulnerability reports, all analyzes must be done manually backwards in accordance with the new security framework. This is a situation that requires a lot of resources, time and expert skill. For this reason, there are large deficiencies in the values of vulnerability scoring systems in the database. The aim of this study is to estimate missing security metrics of vulnerability reports using natural language processing and machine learning algorithms. For this purpose, a model using term frequency inverse document frequency and K-Nearest Neighbors algorithms is proposed. In addition, the obtained data was presented to the use of researchers as a new database. The results obtained are quite promising. A publicly available database was chosen as the data set that all researchers accepted as a reference. This approach facilitates the evaluation and analysis of our model. This study was performed with the largest dataset size available from this database to the best of our knowledge and is one of the limited studies on the latest version of the official scoring system published for classification of software security vulnerabilities. Due to the mentioned issues, our study is a comprehensive and original study in the field.

Cite This Paper

Hakan KEKÜL, Burhan ERGEN, Halil ARSLAN, " Estimating Missing Security Vectors in NVD Database Security Reports", International Journal of Engineering and Manufacturing (IJEM), Vol.12, No.3, pp. 1-13, 2022. DOI: 10.5815/ijem.2022.03.01


[1]P. Mell, K. Scarfone, and S. Romanosky, “A Complete Guide to the Common Vulnerability Scoring System Version 2.0,” FIRSTForum of Incident Response and Security Teams, 2007. (accessed Jan. 01, 2021).

[2]V.-V. Patriciu, I. Priescu, and S. Nicolaescu, “Security metrics for enterprise information systems,” J. Appl. Quant. Methods, vol. 1, no. 2, pp. 151–159, 2006.

[3]M. Schiffman and C. I. A. G. Cisco, “A Complete Guide to the Common Vulnerability Scoring System (CVSS) v1 Archive,” 2005. (accessed Jan. 01, 2021).

[4]G. Spanos, A. Sioziou, and L. Angelis, “WIVSS: A New Methodology for Scoring Information Systems Vulnerabilities,” in Proceedings of the 17th Panhellenic Conference on Informatics, 2013, pp. 83–90, doi: 10.1145/2491845.2491871.

[5]G. Spanos and L. Angelis, “Impact metrics of security vulnerabilities: Analysis and weighing,” Inf. Secur. J. A Glob. Perspect., vol. 24, no. 1–3, pp. 57–71, 2015.

[6]NVD, “NVD,” National Vulnerability Database, 2020. (accessed Jul. 25, 2020).

[7]G. Spanos and L. Angelis, “A multi-target approach to estimate software vulnerability characteristics and severity scores,” J. Syst. Softw., vol. 146, pp. 152–166, 2018, doi: 10.1016/j.jss.2018.09.039.

[8]“Common Vulnerability Scoring System v3.0: User Guide.” (accessed Jan. 01, 2021).

[9]“Common Vulnerability Scoring System v3.1: User Guide.” (accessed Jan. 01, 2021).

[10]S. M. Ghaffarian and H. R. Shahriari, “Software vulnerability analysis and discovery using machine-learning and data-mining techniques: A survey,” ACM Comput. Surv., vol. 50, no. 4, 2017, doi: 10.1145/3092566.

[11]P. Morrison, D. Moye, R. Pandita, and L. Williams, “Mapping the field of software life cycle security metrics,” Inf. Softw. Technol., vol. 102, no. July 2017, pp. 146–159, 2018, doi: 10.1016/j.infsof.2018.05.011.

[12]M. Aota, H. Kanehara, M. Kubo, N. Murata, B. Sun, and T. Takahashi, “Automation of Vulnerability Classification from its Description using Machine Learning,” in 2020 IEEE Symposium on Computers and Communications (ISCC), 2020, pp. 1–7, doi: 10.1109/ISCC50000.2020.9219568.

[13]T. W. Moore, C. W. Probst, K. Rannenberg, and M. van Eeten, “Assessing ICT Security Risks in Socio-Technical Systems (Dagstuhl Seminar 16461),” Dagstuhl Reports, vol. 6, no. 11, pp. 63–89, 2017, doi: 10.4230/DagRep.6.11.63.

[14]C. Theisen and L. Williams, “Better together: Comparing vulnerability prediction models,” Inf. Softw. Technol., vol. 119, no. August 2019, 2020, doi: 10.1016/j.infsof.2019.106204.

[15]Y. Fang, Y. Liu, C. Huang, and L. Liu, “Fastembed: Predicting vulnerability exploitation possibility based on ensemble machine learning algorithm,” PLoS One, vol. 15, no. 2, pp. 1–28, 2020, doi: 10.1371/journal.pone.0228439.

[16]J. Ruohonen, “A look at the time delays in CVSS vulnerability scoring,” Appl. Comput. Informatics, vol. 15, no. 2, pp. 129–135, 2019, doi: 10.1016/j.aci.2017.12.002.

[17]R. Malhotra and Vidushi, “Severity Prediction of Software Vulnerabilities Using Textual Data,” in Proceedings of International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications, 2021, pp. 453–464.

[18]H. Kekül, B. Ergen, and H. Arslan, “A multiclass hybrid approach to estimating software vulnerability vectors and severity score,” J. Inf. Secur. Appl., vol. 63, p. 103028, 2021, doi:

[19]H. Kekül, B. Ergen, and H. Arslan, “Yazılım Güvenlik Açığı Veri Tabanları,” Avrupa Bilim ve Teknol. Derg., no. 28, pp. 1008–1012, 2021.

[20]C. W. Samuel Ndichu, Sylvester McOyowo, Henry Okoyo, “A Remote Access Security Model based on Vulnerability Management,” Int. J. Inf. Technol. Comput. Sci., vol. 12, no. 5, pp. 38–51, 2020, doi: 10.5815/ijitcs.2020.05.03.

[21]H. Kekül, B. Ergen, and H. Arslan, “A New Vulnerability Reporting Framework for Software Vulnerability Databases,” Int. J. Educ. Manag. Eng., vol. 11, no. 3, pp. 11–19, 2021, doi: 10.5815/ijeme.2021.03.02.

[22]A. Fesseha, S. Xiong, E. D. Emiru, M. Diallo, and A. Dahou, “Text Classification Based on Convolutional Neural Networks and Word Embedding for Low-Resource Languages: Tigrinya,” Information, vol. 12, no. 2, 2021, doi: 10.3390/info12020052.

[23]Z. Yin and Y. Shen, “On the dimensionality of word embedding,” arXiv Prepr. arXiv1812.04224, 2018.

[24]A. K. Uysal and S. Gunal, “The impact of preprocessing on text classification,” Inf. Process. Manag., vol. 50, no. 1, pp. 104–112, 2014, doi:

[25]G. Gupta and S. Malhotra, “Text document tokenization for word frequency count using rapid miner (taking resume as an example),” Int. J. Comput. Appl, vol. 975, p. 8887, 2015.

[26]T. Verma, R. Renu, and D. Gaur, “Tokenization and filtering process in RapidMiner,” Int. J. Appl. Inf. Syst., vol. 7, no. 2, pp. 16–18, 2014.

[27]A. A. Jalal and B. H. Ali, “Text documents clustering using data mining techniques.,” Int. J. Electr. Comput. Eng., vol. 11, no. 1, 2021.

[28]K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, “Text classification algorithms: A survey,” Information, vol. 10, no. 4, p. 150, 2019.

[29]Y. Zhang, R. Jin, and Z.-H. Zhou, “Understanding bag-of-words model: a statistical framework,” Int. J. Mach. Learn. Cybern., vol. 1, no. 1–4, pp. 43–52, 2010.

[30]A. Aizawa, “An information-theoretic perspective of tf--idf measures,” Inf. Process. Manag., vol. 39, no. 1, pp. 45–65, 2003.

[31]S. Banerjee and T. Pedersen, “The design, implementation, and use of the ngram statistics package,” in International Conference on Intelligent Text Processing and Computational Linguistics, 2003, pp. 370–381.

[32]E. Fix, Discriminatory analysis: nonparametric discrimination, consistency properties. USAF school of Aviation Medicine, 1951.

[33]R. Kohavi and others, “A study of cross-validation and bootstrap for accuracy estimation and model selection,” in Ijcai, 1995, vol. 14, no. 2, pp. 1137–1145.

[34]G. C. Cawley and N. L. C. Talbot, “On over-fitting in model selection and subsequent selection bias in performance evaluation,” J. Mach. Learn. Res., vol. 11, pp. 2079–2107, 2010.

[35]Y. Yang, “An evaluation of statistical approaches to text categorization,” Inf. Retr. Boston., vol. 1, no. 1, pp. 69–90, 1999.

[36]X. Deng, Y. Li, J. Weng, and J. Zhang, “Feature selection for text classification: A review,” Multimed. Tools Appl., vol. 78, no. 3, pp. 3797–3816, 2019.

[37]Z. Chen, L. J. Zhou, X. Da Li, J. N. Zhang, and W. J. Huo, “The Lao Text Classification Method Based on KNN,” Procedia Comput. Sci., vol. 166, pp. 523–528, 2020, doi:

[38]Y. Tan, “An Improved KNN Text Classification Algorithm Based on K-Medoids and Rough Set,” in 2018 10th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), 2018, vol. 01, pp. 109–113, doi: 10.1109/IHMSC.2018.00032.

[39]M. Sokolova and G. Lapalme, “A systematic analysis of performance measures for classification tasks,” Inf. Process. Manag., vol. 45, no. 4, pp. 427–437, 2009, doi:

[40]C. Bielza, G. Li, and P. Larrañaga, “Multi-dimensional classification with Bayesian networks,” Int. J. Approx. Reason., vol. 52, no. 6, pp. 705–727, 2011, doi:

[41]D. Ballabio, F. Grisoni, and R. Todeschini, “Multivariate comparison of classification performance measures,” Chemom. Intell. Lab. Syst., vol. 174, pp. 33–44, 2018, doi:

[42]L. P. Kobek, “The State of Cybersecurity in Mexico: An Overview,” Wilson Centre’s Mex. Institute, Jan, 2017.