Comparison of Machine Learning Algorithms in Domain Specific Information Extraction

Full Text (PDF, 652KB), PP.13-22

Views: 0 Downloads: 0


M. Rajasekar 1,* Angelina Geetha 1

1. Hindustan Institute of Technology and Science, Chennai, India

* Corresponding author.


Received: 28 Oct. 2021 / Revised: 24 Nov. 2021 / Accepted: 20 Dec. 2021 / Published: 8 Feb. 2023

Index Terms

Machine Learning, Information Extraction, Gynecology, Naive Bayes classifier, Support Vector Machines, K-nearest neibhour classifier


Information Extraction is an essential task in Natural Language Processing. It is the process of extracting useful information from unstructured text. Information extraction helps in most of the NLP applications like sentiment analysis, named entity recognition, medical data extraction, features extraction from research articles, feature extraction from agriculture, etc. Most of the applications in information extraction are performed by machine learning models. Many research work shave been carried out on machine learning based information extraction from various domain texts in English such as Bio medical, Share market, Weather, Business, Social media, Agriculture, Engineering, and Tourism. However domain specific information extraction for a particular regional language is still a challenge. There are different types of classification algorithms. However, for a selected domain to select the appropriate classification algorithm is very difficult. In this paper three famous classification algorithms are selected to do information extraction by classifying the Gynecological domain data in Tamil Language. The main objective or this research work is to analyze the machine learning methods which is suitable for Tamil domain specific text documents. There are 1635 documents being involved in classification task to extract the features by these selected three algorithms. By evaluating the classification task of each model it has been found that the Naive Bayes classification model provides highest accuracy value (84%) for the gynecological domain data. The F1-Score, Error rate and Execution time also evaluated for the selected machine learning models. The evaluation of performance has proved that the Naïve Bayes classification model gives optimal results. It has been concluded that the Naïve Bayes classification model is the best model to classify the gynaecological domain text in Tamil language

Cite This Paper

M. Rajasekar, Angelina Geetha, "Comparison of Machine Learning Algorithms in Domain Specific Information Extraction", International Journal of Mathematical Sciences and Computing(IJMSC), Vol.9, No.1, pp. 13-22, 2023. DOI: 10.5815/ijmsc.2023.01.02


[1]Bhumika, Prof Sukhjit Singh Sehra, Prof Anand Nayyar, "A Review Paper on Algorithms used for Tect Classification", International Journal of Application or Innovation in Engineering & Management (IJAIEM),ISSN 2319 - 4847, Volume 2, Issue 3, March 2013
[2]M. Rajasekar and Dr. A. Udhayakumar,"E Marutuvachi – Information Extraction Framework for data about obstetrics and gynecology in Tamil", Proceedings of the Second International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC 2018), IEEE Xplore, 2018
[3]Xiaoyu Luo, “Efficient English text classification using selected machine learning techniques”, Alexandria Engineering Journal, Volume 60, Issue 3, , Pages 3401-3409, 2021
[4]R. Janani & S. Vijayarani, “Automatic text classification using machine learning and optimization algorithms”, Soft Computing volume 25, pages1129–1145, 2021
[5]Ahmed H. Aliwy, Esraa H. Abdul Ameer, "Comparative Study of Five Text Classification Algorithms with their Improvements", International Journal of Applied Engineering Research ISSN 0973-4562 Volume 12, Number 14, pp. 4309-4319, 2017
[6]Janelyn A and Talingdan, "Performance comparison of different classification algorithms for household poverty classification", The proceedings of 2019 4th International Conference on Information Systems Engineering, IEEE Xplore, pp.11-15, doi:10.1109/ICISE, 2019
[7]R. Srinivasan and C.N. Subalalitha, “Automated Named Entity Recognition from Tamil Documents”, IEEE Proceedings of 1st International Conference on Energy, Systems and Information Processing (ICESIP), pp.1-5, 2019
[8]Rajimol and V.S. Anoop, “A Framework for Named Entity Recognition for Malayalam- A Comparison of different deep learning architectures”, Natural Language Processing Research, Vol.1(1-2) pp.14-22, 2020
[9]Pushpalatha and Dr. Anton Selvadoss Thanamani, “Rule Based Kannada Named Entity Recognition”, Journal of Critical Reviews, Vol. 7, Issue 4. 2020
[10]Ahmed H. Aliwy and Esraa H. Abdul Ameer, “Comparative Study of Five Text classification algorithms with their improvements”, International Journal of Applied Engineering Research ISSN 0973-4562, Vol.12, No.14, pp.4309-4319, 2017