Computational Linguistics Meets Libyan Dialect: A Study on Dialect Identification

PDF (5134KB), PP.118-130

Views: 0 Downloads: 0

Author(s)

Mansour Essgaer 1,* Khamis Massud 1 Rabia Al Mamlook 2 Najah Ghmaid 3

1. Department of Information Systems, Faculty of Information Technology, Sebha University, Sebha Libya

2. Business Administration, Trine University, Angola, Indiana, United States

3. Department of Computer Science, Faculty of Science, Sebha University, Sebha Libya

* Corresponding author.

DOI: https://doi.org/10.5815/ijisa.2025.06.09

Received: 27 Jun. 2025 / Revised: 26 Aug. 2025 / Accepted: 11 Oct. 2025 / Published: 8 Dec. 2025

Index Terms

Libyan Dialect, Dialect Classifications, Logistic Regression

Abstract

This study investigates logistic regression, linear support vector machine, multinomial Naive Bayes, and Bernoulli Naive Bayes for classifying Libyan dialect utterances gathered from Twitter. The dataset used is the QADI corpus, which consists of 540,000 sentences across 18 Arabic dialects. Preprocessing challenges include handling inconsistent orthographic variations and non-standard spellings typical of the Libyan dialect. The chi-square analysis revealed that certain features, such as email mentions and emotion indicators, were not significantly associated with dialect classification and were thus excluded from further analysis. Two main experiments were conducted: (1) evaluating the significance of meta-features extracted from the corpus using the chi-square test and (2) assessing classifier performance using different word and character n-gram representations. The classification experiments showed that Multinomial Naive Bayes (MNB) achieved the highest accuracy of 85.89% and an F1-score of 0.85741 when using a (1,2) word n-gram and (1,5) character n-gram representation. In contrast, Logistic Regression and Linear SVM exhibited slightly lower performance, with maximum accuracies of 84.41% and 84.73%, respectively. Additional evaluation metrics, including log loss, Cohen’s kappa, and Matthew’s correlation coefficient, further supported the effectiveness of MNB in this task. The results indicate that carefully selected n-gram representations and classification models play a crucial role in improving the accuracy of Libyan dialect identification. This study provides empirical benchmarks and insights for future research in Arabic dialect NLP applications.

Cite This Paper

Mansour Essgaer, Khamis Massud, Rabia Al Mamlook, Najah Ghmaid, "Computational Linguistics Meets Libyan Dialect: A Study on Dialect Identification", International Journal of Intelligent Systems and Applications(IJISA), Vol.17, No.6, pp.118-130, 2025. DOI:10.5815/ijisa.2025.06.09

Reference

[1]S. Ghazali, R. Hamdi, and M. Barkat, “Speech rhythm variation in Arabic dialects,” in Speech Prosody, 2002, pp. 331–334. Accessed: Mar. 06, 2025. 
[2]S. Harrat, K. Meftouh, and K. Smaili, “Machine translation for Arabic dialects (survey),” Information Processing & Management, vol. 56, no. 2, pp. 262–273, 2019.
[3]A. Abdelali, H. Mubarak, Y. Samih, S. Hassan, and K. Darwish, “Arabic Dialect Identification in the Wild,” May 15, 2020, arXiv: arXiv:2005.06557. doi: 10.48550/arXiv.2005.06557.
[4]H. Bouamor et al., “The MADAR Arabic dialect corpus and lexicon,” in Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018), 2018. 
[5]M. Maamouri, A. Bies, T. Buckwalter, and W. Mekki, “The penn arabic treebank: Building a large-scale annotated arabic corpus,” in NEMLAR conference on Arabic language resources and tools, Cairo, 2004, pp. 466–467. Accessed: Mar. 06, 2025. 
[6]I. Alsarsour, E. Mohamed, R. Suwaileh, and T. Elsayed, “Dart: A large dataset of dialectal arabic tweets,” in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018. Accessed: Feb. 07, 2025. 
[7]S. Harrat, K. Meftouh, and K. Smaïli, “Maghrebi Arabic dialect processing: an overview,” Journal of International Science and General Applications, vol. 1, 2018, Accessed: Mar. 06, 2025. 
[8]M. Nabil, M. Aly, and A. Atiya, “Astd: Arabic sentiment tweets dataset,” in Proceedings of the 2015 conference on empirical methods in natural language processing, 2015, pp. 2515–2519. 
[9]H. A. Alhammi, R. A. Alfard, and A. Ramadan, “Building a twitter social media network corpus for libyan dialect,” International Journal of Computer Electrical Engineering, vol. 10, no. 1, 2018, Accessed: Feb. 18, 2025.
[10]H. A. Elgabou and D. Kazakov, “Building dialectal Arabic corpora,” in The Proceedings of the First Workshop on Human-Informed Translation and Interpreting Technology (HiT-IT), 2017, pp. 52–57. Accessed: Feb. 07, 2025. 
[11]M. Jarrar, F. A. Zaraket, T. Hammouda, D. M. Alavi, and M. Wählisch, “Lisan: Yemeni, iraqi, libyan, and sudanese arabic dialect corpora with morphological annotations,” in 2023 20th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA), IEEE, 2023, pp. 1–7. Accessed: Mar. 06, 2025.
[12]Y. Matras, “Loanwords in the world’s languages: A comparative handbook,” Language, vol. 88, no. 3, pp. 647–652, 2012.
[13]O. Obeid, M. Salameh, H. Bouamor, and N. Habash, “ADIDA: Automatic dialect identification for Arabic,” in Proceedings of the 2019 conference of the North American chapter of the Association for Computational Linguistics (demonstrations), 2019, pp. 6–11. Accessed: Mar. 06, 2025. [Online]. Available: https://aclanthology.org/N19-4002/
[14]F. Biadsy and J. Hirschberg, “Using prosody and phonotactics in Arabic dialect identification.,” in Interspeech, 2009, pp. 208–211. Accessed: Mar. 06, 2025.
[15]F. S. Alorifi, “Automatic identification of arabic dialects using hidden markov models,” PhD Thesis, University of Pittsburgh, 2008. Accessed: Mar. 06, 2025.
[16]M. Belgacem, G. Antoniadis, and L. Besacier, “Automatic Identification of Arabic Dialects.,” in LREC, 2010. Accessed: Mar. 06, 2025. 
[17]S. Shon, A. Ali, Y. Samih, H. Mubarak, and J. Glass, “ADI17: A fine-grained Arabic dialect identification dataset,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp. 8244–8248. Accessed: Mar. 06, 2025.
[18]M. Abdul-Mageed, C. Zhang, H. Bouamor, and N. Habash, “NADI 2020: The First Nuanced Arabic Dialect Identification Shared Task,” Nov. 09, 2020, arXiv: arXiv:2010.11334. doi: 10.48550/arXiv.2010.11334.
[19]M. Abdul-Mageed, C. Zhang, A. Elmadany, H. Bouamor, and N. Habash, “NADI 2022: The Third Nuanced Arabic Dialect Identification Shared Task,” Oct. 20, 2022, arXiv: arXiv:2210.09582. doi: 10.48550/arXiv.2210.09582.
[20]H. A. Alhammi and K. Haddar, “Building a libyan dialect lexicon-based sentiment analysis system using semantic orientation of adjective-adverb combinations,” International Journal of Computer Theory and Engineering, vol. 12, no. 6, pp. 145–150, 2020.
[21]A. S. Abdulaziz, Code switching between Tamazight and Arabic in the first Libyan Berber news broadcast: An application of Myers-Scotton’s MLF and 4M models. Portland State University, 2014. Accessed: Mar. 06, 2025. 
[22]S. Čéplö, J. Bátora, A. Benkato, J. Milička, C. Pereira, and P. Zemánek, “Mutual intelligibility of spoken Maltese, Libyan Arabic, and Tunisian Arabic functionally tested: A pilot study,” Folia Linguistica, vol. 50, no. 2, Jan. 2016, doi: 10.1515/flin-2016-0021.
[23]M. Athem, M. Essgaer, K. M. S. Ahmed, and A. Agaal, “Building Polar-Oriented Libyan Dialect Corpus Using Emoji-Based Lexicon,” Sebha University Conference Proceedings, vol. 3, no. 2, Art. no. 2, Nov. 2024, doi: 10.51984/sucp.v3i2.3360.
[24]M. Abuzaraida, “Classification of Arabic Comments to Detect Cyberbullying from Social Media Using Convolutional Neural Network and Meta-Learning,” Journal of Computer Science, vol. 21, pp. 622–634, Feb. 2025, doi: 10.3844/jcssp.2025.622.634.
[25]H. Badr, Z. Awahida, M. Essgaer, A. Ajaal, and A. Ahessin, “Named Entity Recognition for Identifying Entities Related to Illegal Migration in Libya: An Analysis of Twitter Textual Data,” in 2024 IEEE 4th International Maghreb Meeting of the Conference on Sciences and Techniques of Automatic Control and Computer Engineering (MI-STA), May 2024, pp. 567–572. doi: 10.1109/MI-STA61267.2024.10599731.
[26]A. Omar, M. Essgaer, and K. M. S. Ahmed, “Using Machine Learning Model To Predict Libyan Telecom Company Customer Satisfaction,” in 2022 International Conference on Engineering & MIS (ICEMIS), Jul. 2022, pp. 1–6. doi: 10.1109/ICEMIS56295.2022.9914055.
[27]V. Ritt-Benmimoun, “Tunisian and Libyan Arabic dialects: common trends-recent developments-diachronic aspects,” 2017, Accessed: Mar. 06, 2025.