IJISA Vol. 17, No. 6, 8 Dec. 2025
Cover page and Table of Contents: PDF (size: 1058KB)
PDF (1058KB), PP.44-57
Views: 0 Downloads: 0
Emotion, Emotion Intensity, Multi-modal, Late Fusion, MFCC, Chroma
This paper presents an ensemble model in the determination of manifestation of emotion intensities from audio-dataset. An emotion denotes the mental state of the human mind or/and thought processes that represents a recognizable pattern of an entity like emotion arousal having a good similarity with its manifestation of vocal, facial or/and bodily signals. In this paper, we propose a stacking, late fusion approach where the best experimental outcome from two base models build from Random Forests and Extreme Gradient Boost are combined using simple majority voting. RAVDESS audio datasets, a public gender balanced dataset built by Ryerson University of Canada for the purpose of emotion study was used. 80% of the dataset was used for training while 20% was used for testing. Two features, MFCC and Chroma were introduced to the base models in a series of experimental setups and the outcome evaluated using confusion matrix, precision, recall and F1-Score. It was then compared to two state-of-the-art works done on KBES and RAVDESS datasets. This approach yielded an overall classification accuracy of 93%.
Simon Kipyatich Kiptoo, Kennedy Ogada, Tobias Mwalili, "Determining Emotion Intensities from Audio Data Using Ensemble Models: A Late Fusion Approach", International Journal of Intelligent Systems and Applications(IJISA), Vol.17, No.6, pp.44-57, 2025. DOI:10.5815/ijisa.2025.06.04
[1]Human Emotions and their manifestation: https://www.all-about-psychology.com/human-emotions-and-their-manifestation.html/ accessed 4th December, 2024
[2]Kendra Cherry. “Basic Emotions and Their Effect on Human Behaviour”. https://www.verywellmind.com/an-overview-of-the-types-of-emotions-4163976, (2019).
[3]Poria S., Majumder N., Mihalcea R., and Hovy E., “Emotion Recognition in Conversation: Research Challenges, Datasets, and Recent Advances”. Michigan Institute for Data Science, US, 2019.
[4]Udahemuka, G., Djouani, K., & Kurien, A. M., “Multimodal Emotion Recognition Using Visual, Vocal and Physiological Signals”: A Review. Applied Sciences, 14(17), 2024.
[5]Joshi S. & Joshi F., “Human Emotion Classification Based on EEG Signals Using Recurrent Neural Network and KNN” DO - 10.48550/arXiv.2205.08419 ER –. (2022).
[6]Md Shad Akhtar, Asif Ekbal Erik Cambria: How Intense Are You? Predicting Intensities of Emotions and Sentiments Using Stacked Ensemble. https://sentic.net/predicting-intensities-of-emotions-and-sentiments.pdf Accessed on 24th December, 2024
[7]S. Sultana, M. Z. Iqbal, M. R. Selim, M. M. Rashid, and M. S. Rahman, “Bangla Speech Emotion Recognition and Cross-Lingual Study Using Deep CNN and BLSTM Networks,” IEEE Access, vol. 10, 2022.
[8]Md. Masum Billah, Md. Likhon Sarker, M. A. H. Akhand, Md Abdus Samad Kamal, “Emotion Recognition with Intensity Level from Bangla Speech using Feature Transformation and Cascaded Deep Learning Model”. International Journal of Advanced Computer Science and Applications, (IJACSA) Vol. 15, No. 4, 2024.
[9]Islam, M.R.; Akhand, M.A.H.; Kamal, M.A.S.; Yamada, K., “Recognition of Emotion with Intensity from Speech Signal Using 3D Transformed Feature and Deep Learning”. Electronics 2022, 11, 2362.
[10]Mustaqeem, Kwon S., “CLSTM Deep Feature-Based Speech Emotion Recognition Using the Hierarchical ConvLSTM Network”. Mathematics. 2020; 8(12):2133. https://doi.org/10.3390/math8122133
[11]Zhao, Z.; Li, Q.; Zhang, Z.; Cummins, N.; Wang, H.; Tao, J.; Schuller, B.W., “Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition”. Neural Netw. 2021, 141, 52–60.
[12]Phan, Thi-Thu-Hong & Nguyen Doan, Dong & Nguyen, Huu-Du & Nguyen, Van & Pham-Hong, Thai. (2022). Investigation on New Mel Frequency Cepstral Coefficients Features and Hyper-parameters Tuning Technique for Bee Sound Recognition. Soft Computing. 10.1007/s00500-022-07596-6
[13]Alang Rashid, N. K., Alim, S. A., & Hashim, S. W. (2017). Receiver operating characteristics measure for the recognition of stuttering dysfluencies using line spectral frequencies. IIUM Engineenng Joumal, 18(1), 193-200.
[14]Ayush Kumar Shah & Araju Nepal: Chroma Feature Extraction, Department of Computer Science and Engineering, School of Engineering Kathmandu University, Nepal: https://www.academia.edu/42216949/Chroma_Feature_Extraction Accessed on 24th December, 2024
[15]Livingstone, S. R., & Russo, F. A. (2018). Retlieved July 2, 2022, from). The Ryerson audiovisual database of emotional speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in north Amelican English. PLoS One, 13(5), e0196391. https:// doi.org/I O. 1371 /journal.pone.() 196391
[16]Patricia J. Bota, Chen Wang, Ana L. N. Fred and Hugo Placido da Silva (2019): A Review, Current Challenges, and Future Possibilities on Emotion Recognition Using Machine Learning and Physiological Signals. IEEE Open Access Journal.
[17]Liu D, Wang Z, Wang L and Chen L (2021): “Multi-Modal Fusion Emotion Recognition Method of Speech Expression Based on Deep Learning”. School of Information Engineering, Shandong Youth University of Political Science, Jinan, China