Bengali News Headline Categorization Using Optimized Machine Learning Pipeline

Full Text (PDF, 369KB), PP.15-24

Views: 0 Downloads: 0


Prashengit Dhar 1,* Zainal Abedin 2

1. Department of Computer Science and Engineering. Cox’s Bazar City College, Bangladesh

2. Faculty of Science Engineering and Technology, University of Science & Technology Chittagong- 1079, Bangladesh

* Corresponding author.


Received: 15 Oct. 2020 / Revised: 10 Nov. 2020 / Accepted: 25 Dec. 2020 / Published: 8 Feb. 2021

Index Terms

Bengali text, tokenization, stop word deletion, genetic algorithm, categorization


Bengali text based news portal is now very common and increasing day by day. With easy access of internet technology, reading news through online is now a regular task. Different types of news are represented in the news portal. The system presented in this paper categorizes the news headline of news portal or sites. Prediction is made by machine learning algorithm. Large number of collected data are trained and tested. As pre-processing tasks such as tokenization, digit removal, removing punctuation marks, symbols, and deletion of stop words are processed. A set of stop words is also created manually. Strong stop words leads to better performance. Stop words deletion plays a lead role in feature selection. For optimization, genetic algorithm is used which results in reduced feature size. A comparison is also explored without optimization process. Dataset is established by collecting news headline from various Bengali news portal and sites. Resultant output shows well performance in categorization.

Cite This Paper

Prashengit Dhar, Md. Zainal Abedin, "Bengali News Headline Categorization Using Optimized Machine Learning Pipeline", International Journal of Information Engineering and Electronic Business(IJIEEB), Vol.13, No.1, pp. 15-24, 2021. DOI:10.5815/ijieeb.2021.01.02


[1]Farkhund Iqbal, Jahanzeb Maqbool, Benjamin C M Fung, Rabia Batool, Asad Masood Khattak, Saiqa Aleem, and Patrick C K Hung, "A Hybrid Framework for Sentiment Analysis Using Genetic Algorithm Based Feature Reduction," in IEEE Access, vol 7, pp 14637-14652, 2019, doi: 1 1109/ACCESS 2019 2892852
[2]A Collomb, C Costea, D Joyeux, O Hasan, and L Brunie, “A study and comparison of sentiment analysis methods for reputation evaluation,” Rapport de recherche RR-LIRIS-2014-002, 2014
[3]B Pang and L Lee, “Opinion mining and sentiment analysis,” Foundations and trends in information retrieval, vol 2, no 1-2, pp 1–135, 2008
[4]X Ding, B Liu, and P S Yu, “A holistic lexicon-based approach to opinion mining,” in Proceedings of the 2008 International Conference on Web Search and Data Mining ACM, 2008, pp 231–240
[5]M Taboada, J Brooke, M Tofiloski, K Voll, and M Stede, “Lexiconbased methods for sentiment analysis,” Computational linguistics, vol 37, no 2, pp 267–307, 2011 Chen, S Y, & Hsieh, J W Boosted road sign detection and recognition In Proc of Intl Conference on Machine Learning and Cybernetics, 2008 pp 3823–3826
[6]A Khan, B Baharudin, and K Khairullah, “Sentiment classification using sentence-level lexical based semantic orientation of online reviews,” Trends in Applied Sciences Research, vol 6, no 10, pp 1141–1157, 2011
[7]Hawalah, Ahmad 2019 "Semantic Ontology-Based Approach to Enhance Arabic Text Classification " Big Data Cogn Comput 3, no 4: 53
[8]F Ciravegna, L Gilardoni, A Lavelli, S Mazza, W J Black, M Ferraro, et al , "Flexible text classification for financial applications: the FACILE system," in ECAI, 2000, pp 696-700
[9]T Zagibalov and J Carroll, "Automatic seed word selection for unsupervised sentiment classification of Chinese text," in Proceedings of the 22nd International Conference on Computational LinguisticsVolume 1, 2008, pp 1073-108
[10]A S Patil and B Pawar, "Automated classification of web sites using Naive Bayesian algorithm," in Proceedings of the International MultiConference of Engineers and Computer Scientists, 2012, pp 14-16
[11]N Suguna and K Thanushkodi, "An improved K-nearest neighbor classification using Genetic Algorithm," International Journal of Computer Science Issues, vol 7, pp 18-21, 2010
[12]L Jiang, Z Cai, H Zhang, and D Wang, "Naive Bayes text classifiers: a locally weighted learning approach," Journal of Experimental & Theoretical Artificial Intelligence, vol 25, pp 273-286, 2013
[13]Q Yuan, G Cong, and N M Thalmann, "Enhancing naive bayes with various smoothing methods for short text classification," in Proceedings of the 21st international conference companion on World Wide Web, 2012, pp 645-646
[14]C Cortes and V Vapnik, "Support-vector networks," Machine learning, vol 20, pp 273-297, 1995
[15]Anuja P Jain and Padma Dandannavar 2016 “Application of machine learning techniques to sentiment analysis”, International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT), 628- 632
[16]Pedregosa,F (2011) Scikit-learn: machine Learning in Python J Mach Learn Res , 12, 2825–283
[17]Olson R S , Moore J H (2019) TPOT: A Tree-Based Pipeline Optimization Tool for Automating Machine Learning In: Hutter F , Kotthoff L , Vanschoren J (eds) Automated Machine Learning The Springer Series on Challenges in Machine Learning Springer, Cham https://doi org/1 1007/978-3-030-05318-5_8
[18]Anurag Sarkar, Debabrata Datta,"A Frequency Based Approach to Multi-Class Text Classification", International Journal of Information Technology and Computer Science, Vol.9, No.5, pp.15-22, 2017. 
[19]Vinay K. Jain, Shishir Kumar, "Towards Prediction of Election Outcomes Using Social Media", International Journal of Intelligent Systems and Applications, Vol.9, No.12, pp.20-28, 2017.
[20]Soumick Chatterjee, Pramod George Jose, Debabrata Datta, "Text Classification Using SVM Enhanced by Multithreading and CUDA", International Journal of Modern Education and Computer Science, Vol.11, No.1, pp. 11-23, 2019.