Weighted Late Fusion based Deep Attention Neural Network for Detecting Multi-Modal Emotion

PDF (1620KB), PP.106-127

Views: 0 Downloads: 0

Author(s)

Srinivas P. V. V. S. 1,* Shaik Nazeera Khamar 2 Nohith Borusu 2 Mohan Guru Raghavendra Kota 2 Harika Vuyyuru 2 Sampath Patchigolla 2

1. Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation (KLEF), Vaddeswaram, Guntur District 522302, India

2. Department of Computer Science and Information Technology, K L University, Vaddeswaram, Guntur District 522302, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijigsp.2026.01.07

Received: 9 Jul. 2025 / Revised: 4 Oct. 2025 / Accepted: 20 Nov. 2025 / Published: 8 Feb. 2026

Index Terms

Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), 3D-Convolutional Neural Network (3D-CNN), Mel-Frequency Cepstral Coefficients (MFCCs)

Abstract

In the field of affective computing research, multi-modal emotion detection has gained popularity as a way to boost recognition robustness and get around the constraints of processing a multiple type of data. Human emotions are utilized for defining a variety of methodologies, including physiological indicators, facial expressions, as well as neuroimaging tactics. Here, a novel deep attention mechanism is used for detecting multi-modal emotions. Initially, the data are collected from audio and video features. For dimensionality reduction, the audio features are extracted using Constant-Q chromagram and Mel-Frequency Cepstral Coefficients (MM-FC2). After extraction, the audio generation is carried out by a Convolutional Dense Capsule Network (Conv_DCN) is used. Next is video data; the key frame extraction is carried out using Enhanced spatial-temporal and Second-Order Gaussian kernels. Here, Second-Order Gaussian kernels are a powerful tool for extracting features from video data and converting it into a format suitable for image-based analysis. Next, for video generation, DenseNet-169 is used. At last, all the extracted features are fused, and emotions are detected using a Weighted Late Fusion Deep Attention Neural Network (WLF_DAttNN). Python tool is used for implementation, and the performance measure achieved an accuracy of 97% for RAVDESS and 96% for CREMA-D dataset.

Cite This Paper

Srinivas P. V. V. S., Shaik Nazeera Khamar, Nohith Borusu, Mohan Guru Raghavendra Kota, Harika Vuyyuru, Sampath Patchigolla, "Weighted Late Fusion based Deep Attention Neural Network for Detecting Multi-Modal Emotion", International Journal of Image, Graphics and Signal Processing(IJIGSP), Vol.18, No.1, pp. 106-127, 2026. DOI:10.5815/ijigsp.2026.01.07

Reference

[1]W. Liu, J. L. Qiu, W. L. Zheng, and B. L. Lu, “Comparing recognition performance and robustness of multimodal deep learning models for multimodal emotion recognition,” IEEE Transactions on Cognitive and Developmental Systems, vol. 14, no. 2, pp. 715-729, 2021. 
[2]Z. Ma, F. Ma, B. Sun, and S. Li, “Hybrid mutimodal fusion for dimensional emotion recognition,” In Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge, pp. 29-36, 2021.
[3]G. Tu, J. Wen, H. Liu, S. Chen, L. Zheng, and D. Jiang, “Exploration meets exploitation: Multitask learning for emotion recognition based on discrete and dimensional models,” Knowledge-Based Systems, vol. 235, p. 107598, 2022.
[4]P. Bhattacharya, R. K. Gupta, and Y. Yang, “Exploring the contextual factors affecting multimodal emotion recognition in videos,” IEEE Transactions on Affective Computing, 2021. 
[5]J. Liang, R. Li, and Q. Jin, “Semi-supervised multi-modal emotion recognition with cross-modal distribution matching,” In Proceedings of the 28th ACM international conference on multimedia, pp. 2852-2861, 2020.
[6]C. M. A. Ilyas, R. Nunes, K. Nasrollahi, M. Rehm, and T. B. Moeslund, “Deep Emotion Recognition through Upper Body Movements and Facial Expression,” In VISIGRAPP (5: VISAPP), pp. 669-679, 2021.
[7]S. Liu, P. Gao, Y. Li, W. Fu, and W. Ding, “Multi-modal fusion network with complementarity and importance for emotion recognition,” Information Sciences, vol. 619, pp. 679-694, 2023.
[8]Y. Tan, Z. Sun, F. Duan, J. Solé-Casals, and C. F. Caiafa, “A multimodal emotion recognition method based on facial expressions and electroencephalography,” Biomedical Signal Processing and Control, vol. 70, p. 103029, 2021.
[9]J. Chen, C. Wang, K. Wang, C. Yin, C. Zhao, T. Xu, and T. Yang, “HEU Emotion: a large-scale database for multimodal emotion recognition in the wild,” Neural Computing and Applications, vol. 33, pp. 8669-8685, 2021.
[10]Y. Yao, M. Papakostas, M. Burzo, M. Abouelenien, and R. Mihalcea, “Muser: Multimodal stress detection using emotion recognition as an auxiliary task,”  arXiv preprint arXiv:2105.08146. 2021. 
[11]M. Martinc, and S. Pollak, “Tackling the ADReSS Challenge: A Multimodal Approach to the Automated Recognition of Alzheimer's Dementia,” In Interspeech, pp. 2157-2161, 2020. 
[12]A. Nirmal, D.Jayaswal and P. H. Kachare, “A Hybrid Bald Eagle-Crow Search Algorithm for Gaussian mixture model optimization in the speaker verification framework,” Decision Analytics Journal, p. 100385, 2023.
[13]P. Baki, A Multimodal Approach for Automatic Mania Assessment in Bipolar Disorder. arXiv preprint arXiv:2112.09467. 2021.
[14]C. Li, Z. Bao, L. Li and Z. Zhao, “Exploring temporal representations by leveraging attention-based bidirectional LSTM-RNNs for multi-modal emotion recognition,” Information Processing & Management, vol. 57, no. 3, p. 102185, 2020.
[15]Q. Wang, M. Wang, Y. Yang and X. Zhang, “Multi-modal emotion recognition using EEG and speech signals,” Computers in Biology and Medicine, vol. 149, p. 105907, 2022.
[16]B. Xie, M. Sidulova and C. H. Park, “Robust multimodal emotion recognition from conversation with transformer-based crossmodality fusion,” Sensors, vol. 21, no. 14, p. 4913, 2021.   
[17]N. H. Ho, H. J. Yang, S. H. Kim and G. Lee, “Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network,” IEEE Access, vol. 8, pp. 61672-61686, 2020.
[18]L. Sun, Z. Lian, J. Tao, B. Liu and M. Niu, “Multi-modal continuous dimensional emotion recognition using recurrent neural network and self-attention mechanism,” In Proceedings of the 1st international on multimodal sentiment analysis in real-life media challenge and workshop, pp. 27-34, 2020.
[19]B. Chen, Q. Cao, M. Hou, Z. Zhang, G. Lu and D. Zhang, “Multimodal emotion recognition with temporal and semantic consistency,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3592-3603, 2021.
[20]M. F. Carbonell, M. Boman and P. Laukka, “Comparing supervised and unsupervised approaches to multimodal emotion recognition,” PeerJ Computer Science, vol. 7, p. e804, 2021.
[21]Y. Zhang, C. Cheng and Y. Zhang, “Multimodal emotion recognition using a hierarchical fusion convolutional neural network,” IEEE access, vol. 9, pp. 7943-7951, 2021. 
[22]D. N. Krishna and A. Patil, “Multimodal Emotion Recognition Using Cross-Modal Attention and 1D Convolutional Neural Networks,” In Interspeech, pp. 4243-4247, 2020.
[23]C. Luna-Jiménez, D. Griol, Z. Callejas, R. Kleinlein, J. M. Montero and F. Fernández-Martínez, “Multimodal emotion recognition on RAVDESS dataset using transfer learning,” Sensors, vol. 21, no. 22, p. 7665, 2021.
[24]E. S. Salama, R. A. El-Khoribi, M. E. Shoman and M. A. W. Shalaby, “A 3D-convolutional neural network framework with ensemble learning techniques for multi-modal emotion recognition,” Egyptian Informatics Journal, vol. 22, no. 2, pp. 167-176, 2021.
[25]M. Sharafi, M. Yazdchi, R. Rasti, and F. Nasimi, “A novel spatio-temporal convolutional neural framework for multimodal emotion recognition,” Biomedical Signal Processing and Control, vol. 78, p. 103970, 2022.
[26]S. Alagarsamy, V. Govindaraj, M. Irfan, R. Swami and N. M. Kumar, “Smart recognition of real time face using convolution neural network (CNN) Technique,” vol. 83, pp. 23406-23411, 2021.
[27]A. Vulli, P. N. Srinivasu, M. S. K. Sashank, J. Shafi, J. Choi and M. F. Ijaz, “Fine-tuned DenseNet-169 for breast cancer metastasis prediction using FastAI and 1-cycle policy,” Sensors, vol. 22, no. 8, p. 2988, 2022.
[28]M. R. Ahmed, S. Islam, A. M. Islam and S. Shatabda, “An ensemble 1D-CNN-LSTM-GRU model with data augmentation for speech emotion recognition,” Expert Systems with Applications, vol. 218, p. 119633, 2023.