IJITCS Vol. 17, No. 6, 8 Dec. 2025
Cover page and Table of Contents: PDF (size: 1807KB)
PDF (1807KB), PP.29-51
Views: 0 Downloads: 0
Tamil Slang Classification, FRAE-PSA Module, Multi-Task Learning, Acoustic Feature Fusion, Regional Accent Detection and Emotion Detection, Pyramid Split Attention (PSA), Industry, Innovation and Infrastructure
In Artificial Intelligence, voice categorization is important for various applications. Tamil, being one of the oldest languages in the world, comprises rich regional slang differing in tone, pronunciation, and emotive expression. These slang words are difficult to categorize because they are informal and there is limited annotated audio data. This study proposes an enhanced deep learning framework for Tamil slang classification using a balanced audio corpus. The framework integrates data-specific pre-processing techniques, including Mel spectrograms, Chroma features and spectral contrast, to capture the nuanced characteristics of Tamil speech. A DenseNet backbone, combined with LSTM and GRU layers, models both temporal and spectral information. The suggested FRAE-PSA module is an innovative application of the Pyramid Split Attention (PSA) mechanism adapted to support regional and affective variations of speech. Different from current PSA or Transformer-based approaches, FRAE-PSA splits the audio frequency spectrum and adapts attention weights dynamically based on auxiliary tasks. A multi-branch architecture is employed to fuse temporal and spectral features effectively and multi-task learning is used to enhance regional accent and emotion detection. Custom loss functions and lightweight networks optimize model efficiency. Experimental results show up to a 15% improvement in classification accuracy over baseline models, demonstrating the framework's effectiveness for real-world Tamil slang classification tasks.
Ramkumar. R., Sureshkumar Nagarajan, Dinesh Prasanth Ganapathi, "Enhanced Deep Learning Framework for Tamil Slang Classification with Multi-task Learning and Attention Mechanisms", International Journal of Information Technology and Computer Science(IJITCS), Vol.17, No.6, pp.29-51, 2025. DOI:10.5815/ijitcs.2025.06.02
[1]Khan MA, Huang Y, Feng J, Prasad BK, Ali Z, Ullah I, Kefalas P. A Multi-Attention Approach Using BERT and Stacked Bidirectional LSTM for Improved Dialogue State Tracking. Applied Sciences. 2023; 13(3):1775. https://doi.org/10.3390/app13031775
[2]Song, T., Nguyen, L. T. H., & Ta, T. V. (2025). MPSA-DenseNet: A novel deep learning model for English accent classification. Computer Speech & Language, 89, 101676.
[3]Myint, P. Y. W., Lo, S. L., & Zhang, Y. (2024). Unveiling the dynamics of crisis events: Sentiment and emotion analysis via multi-task learning with attention mechanism and subject-based intent prediction. Information Processing & Management, 61(4), 103695.
[4]N. Prasad, S. Saha and P. Bhattacharyya, "A Multimodal Classification of Noisy Hate Speech using Character Level Embedding and Attention," 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 2021, pp. 1-8, doi: 10.1109/IJCNN52387.2021.9533371.
[5]N. -H. Ho, H. -J. Yang, S. -H. Kim and G. Lee, "Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network," in IEEE Access, vol. 8, pp. 61672-61686, 2020, doi: 10.1109/ACCESS.2020.2984368.
[6]Yafooz, W. M. (2024). Enhancing Arabic Dialect Detection on Social Media: A Hybrid Model with an Attention Mechanism. Information, 15(6), 316.
[7]Subhash, D., Premjith, B., & Ravi, V. (2025). A robust accent classification system based on variational mode decomposition. Engineering Applications of Artificial Intelligence, 139, 109512.
[8]Owodunni, A. T., Yadavalli, A., Emezue, C. C., Olatunji, T., & Mbataku, C. C. (2024). AccentFold: A Journey through African Accents for Zero-Shot ASR Adaptation to Target Accents. arXiv preprint arXiv:2402.01152.
[9]O. Ozturk, H. Kilimci, H. H. Kilinc and Z. H. Kilimci, "Spoken Accent Detection in English Using Audio-Based Transformer Models," 2024 9th International Conference on Computer Science and Engineering (UBMK), Antalya, Turkiye, 2024, pp. 539-544, doi: 10.1109/UBMK63289.2024.10773414.
[10]R. Fu et al., "MM DialogueGAT- A Fusion Graph Attention Network for Emotion Recognition using Multi-model System," in IEEE Access, doi: 10.1109/ACCESS.2024.3350156.
[11]Lin, C., Cheng, H., Rao, Q., & Yang, Y. (2024). M 3 SA: Multimodal Sentiment Analysis Based on Multi-Scale Feature Extraction and Multi-Task Learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[12]Guo, X. (2019). Multi-label Classification and Sentiment Analysis on Textual Records.
[13]Soni, J., Mathur, K. Enhancing sentiment analysis via fusion of multiple embeddings using attention encoder with LSTM. Knowl Inf Syst 66, 4667–4683 (2024). https://doi.org/10.1007/s10115-024-02102-w
[14]Wang, C., & Shen, X. (2024). Feature-Enhanced Multi-Task Learning for Speech Emotion Recognition Using Decision Trees and LSTM. Electronics, 13(14), 2689.
[15]Mountzouris, K., Perikos, I., & Hatzilygeroudis, I. (2023). Speech Emotion Recognition Using Convolutional Neural Networks with Attention Mechanism. Electronics, 12(20), 4376.
[16]Yang, B., Li, D., & Yang, N. (2019). Intelligent Judicial Research Based on BERT Sentence Embedding and Multi-Level Attention CNNs. https://www.semanticscholar.org/paper/f25e89b19e620e20fc78b5f3ff8124eb354cd8c6
[17]Nassereldine, A., Liu, D., Xu, C., & Xiong, J. (2024). PI-Whisper: An Adaptive and Incremental ASR Framework for Diverse and Evolving Speaker Characteristics. arXiv preprint arXiv:2406.15668.
[18]Luo, J., & Tang, Z. (2023). Feature Selection and Fusion in Cantonese Speech Emotion Analysis. Academic Journal of Computing & Information Science, 6(13), 169-177.
[19]Tesfagergish, S. G., Damaševičius, R., & Kapočiūtė-Dzikienė, J. (2023). Deep learning-based sentiment classification in Amharic using multi-lingual datasets. Computer science and information systems, 20(4), 1459-1481.
[20]Chen, Q. (2018). Investigating context influence in character Level LSTM methods for Japanese Auto Punctuation.
[21]R. R and K. R. Anne, "CNN-RNN Hybrid Model to Classify a Local Language Slangs using Spectral features," 2024 International Conference on Inventive Computation Technologies (ICICT), Lalitpur, Nepal, 2024, pp. 600-607, doi: 10.1109/ICICT60155.2024.10544381.
[22]Dhakal, Manish & Chhetri, Arman & Gupta, Aman & Lamichhane, Prabin & Pandey, Suraj & Shakya, Subarna. (2022). Automatic speech recognition for the Nepali language using CNN, bidirectional LSTM and ResNet. 515-521. 10.1109/ICICT54344.2022.9850832.
[23]Ahamed M.R. Faiyaz; Arachchi S. P. Kasthuri (2022), LSTM Based Emotion Analysis of Text in Tamil Language, 7th International Conference on Advances in Technology and Computing (ICATC 2022), Faculty of Computing and Technology, University of Kelaniya Sri Lanka. Page 73 – 79.
[24]Chen, S., Zhang, Y., & Yang, Q. (2024). Multi-task learning in natural language processing: An overview. ACM Computing Surveys, 56(12), 1-32.
[25]Ghorbani, S., & Hansen, J. H. (2024). Advanced accent/dialect identification and accentedness assessment with multi-embedding models and automatic speech recognition. The Journal of the Acoustical Society of America, 155(6), 3848-3860.
[26]Tanveer, M., Rastogi, A., Paliwal, V., Ganaie, M. A., Malik, A. K., Del Ser, J., & Lin, C. T. (2023). Ensemble deep learning in speech signal tasks: a review. Neurocomputing, 550, 126436.
[27]Nematullah, O., Askar, S., Wahhab, S., & Khidir, B. (2024). Emotion Recognition in Kurdish Speech from the Sorani Dialect Corpus. Zanco Journal of Pure and Applied Sciences, 36(5), 104-112.
[28]Ali, R., Farhat, T., Abdullah, S., Akram, S., Alhajlah, M., Mahmood, A., & Iqbal, M. A. (2023). Deep learning for sarcasm identification in news headlines. Applied Sciences, 13(9), 5586.
[29]Shubhi Bansal, Kushaan Gowda, Nagendra Kumar, Multilingual personalized hashtag recommendation for low resource Indic languages using graph-based deep neural network, Expert Systems with Applications, Volume 236, 2024, 121188, ISSN 0957-4174, https://doi.org/10.1016/j.eswa.2023.121188.
[30]Maheswari, S.U., Dhenakaran, S.S. Improved ensemble based deep learning approach for sarcastic opinion classification. Multimed Tools Appl 83, 38267–38289 (2024). https://doi.org/10.1007/s11042-023-16891-9
[31]Liang, X., Zhang, Z., & Xu, R. (2023). Multi-task deep cross-attention networks for far-field speaker verification and keyword spotting. EURASIP Journal on Audio, Speech, and Music Processing, 2023(1), 28.
[32]Malliga Subramanian, Rahul Ponnusamy, Sean Benhur, Kogilavani Shanmugavadivel, Adhithiya Ganesan, Deepti Ravi, Gowtham Krishnan Shanmugasundaram, Ruba Priyadharshini, Bharathi Raja Chakravarthi, Offensive language detection in Tamil YouTube comments by adapters and cross-domain knowledge transfer, Computer Speech & Language, Volume 76, 2022, 101404, ISSN 0885-2308, https://doi.org/10.1016/j.csl.2022.101404.
[33]Akinpelu, S., & Viriri, S. (2024). Deep Learning Framework for Speech Emotion Classification: A Survey of the State-of-the-Art. IEEE Access.
[34]Zhang, P., Fu, M., Zhao, R., Zhang, H., & Luo, C. (2024). PURE: Personality-Coupled Multi-Task Learning Framework for Aspect-Based Multimodal Sentiment Analysis. IEEE Transactions on Knowledge and Data Engineering.
[35]Mazari, A.C., Boudoukhani, N. & Djeffal, A. BERT-based ensemble learning for multi-aspect hate speech detection. Cluster Comput 27, 325–339 (2024). https://doi.org/10.1007/s10586-022-03956-x
[36]Shao, D., Su, S., Ma, L. et al. DA-BAG: A multi-model fusion text classification method combining BERT and GCN using self-domain adversarial training. J Intell Inf Syst (2024). https://doi.org/10.1007/s10844-024-00889-2
[37]J. Xue, R. Qin, X. Zhou, H. Liu, M. Zhang and Z. Zhang, "Fusing Multi-Level Features from Audio and Contextual Sentence Embedding from Text for Interview-Based Depression Detection," ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, Republic of, 2024, pp. 6790-6794, doi: 10.1109/ICASSP48485.2024.10446253.
[38]Zhang S, Feng Y, Ren Y, Guo Z, Yu R, Li R, Xing P. Multi-Modal Emotion Recognition Based on Wavelet Transform and BERT-RoBERTa: An Innovative Approach Combining Enhanced BiLSTM and Focus Loss Function. Electronics. 2024; 13(16):3262. https://doi.org/10.3390/electronics13163262
[39]A. S. Sundar, C. -H. H. Yang, D. M. Chan, S. Ghosh, V. Ravichandran and P. S. Nidadavolu, "Multimodal Attention Merging for Improved Speech Recognition and Audio Event Classification," 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), Seoul, Korea, Republic of, 2024, pp. 655-659, doi: 10.1109/ICASSPW62465.2024.10627466.
[40]Sundar, A., Ramakrishnan, A., Balaji, A. et al. Hope Speech Detection for Dravidian Languages Using Cross-Lingual Embeddings with Stacked Encoder Architecture. SN COMPUT. SCI. 3, 67 (2022). https://doi.org/10.1007/s42979-021-00943-8