IJITCS Vol. 18, No. 1, 8 Feb. 2026
Cover page and Table of Contents: PDF (size: 1046KB)
PDF (1046KB), PP.84-99
Views: 0 Downloads: 0
Speech Emotion Recognition, Machine Learning, Multimodal, Ensemble Model
In the field of human-computer interaction, identifying emotion from speech and understanding the full context of spoken communication is a challenging task due to the imprecise nature of emotion, which requires detailed speech analysis. In the area of speech emotion recognition, various techniques have been employed to extract emotions from audio signals, including several well-established speech analysis and classification methods. Despite numerous advancements in recent years, many studies still fail to consider the semantic information present in speech. Our study proposes a novel approach that captures both the paralinguistic and semantic aspects of the speech signal by combining state-of-the-art machine learning techniques with carefully crafted feature extraction strategies. We address this task using feature-engineering-based techniques, which involve extracting meaningful audio features such as energy, pitch, harmonics, pauses, central momentum, chroma, zero-crossing rate, and Mel-frequency cepstral coefficients (MFCCs). These features capture important acoustic patterns that help the model learn emotional cues more effectively. This work is primarily conducted on the IEMOCAP dataset, a large and well-annotated emotional speech corpus. By framing our task as a multi-class classification problem, we extract 15 features from the audio signal and use them to train five machine learning classifiers. Additionally, we incorporate text-domain features to reduce ambiguity in emotional interpretation. We evaluate our model's performance using accuracy, precision, recall, and F-score across all experiments.
Rania Ahmed, Mahmoud Hussein, Arabi keshk, "Ensemble Fusion Model for Enhanced Speech Emotion Recognition and Confusion Resolution", International Journal of Information Technology and Computer Science(IJITCS), Vol.18, No.1, pp.84-99, 2026. DOI:10.5815/ijitcs.2026.01.05
[1]Taiba Majid Wani, Teddy Surya Gunawan, Syed Asif Ahmad Qadri, Mira Kartiwi, Eliathamby Ambikairajah, “A Comprehensive Review of Speech Emotion Recognition Systems”, IEEE Access, Vol.9, pp.47795–47814, 2021. DOI:10.1109/ACCESS.2021.3068045.
[2]Xibin Dong, Zhiwen Yu, Wenming Cao, Yifan Shi, Qianli Ma, “A Survey on Ensemble Learning”, Frontiers of Computer Science, Vol.14, No.2, pp.241–258, 2020. DOI:10.1007/s11704 019 8208 z.
[3]Elena Rudkowsky, Martin Haselmayer, Matthias Wastian, Marcelo Jenny, Štefan Emrich, Michael Sedlmair, “More Than Bags of Words: Sentiment Analysis With Word Embeddings”, Communication Methods and Measures, Vol.12, No.2–3, pp.140–157, 2018. DOI:10.1080/19312458.2018.1455817.
[4]Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, Shrikanth Narayanan, “IEMOCAP: Interactive Emotional Dyadic Motion Capture Database”, Language Resources and Evaluation, Vol.42, No.4, pp.335–359, 2008. DOI:10.1007/s10579-008-9076-6.
[5]Dias Issa, M.Fatih Demirci, Adnan Yazici, “Speech Emotion Recognition With Deep Convolutional Neural Networks”, Biomedical Signal Processing and Control, Vol.59, pp.101894, 2020. DOI: 10.1016/j.bspc.2020.101894.
[6]Gaurav Sahu, “Multimodal Speech Emotion Recognition and Ambiguity Resolution”, arXiv preprint abs/1904.06022, 2019. DOI:10.48550/ARXIV.1904.06022.
[7]Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli, “wav2vec 2.0: A Framework for Self Supervised Learning of Speech Representations”, Advances in Neural Information Processing Systems, Vol.33, pp.12449–12460, 2020. DOI:10.5555/3454287.3455167.
[8]Luca Pepino, Pere Riera, Luis Ferrer, “Emotion Recognition From Speech Using wav2vec 2.0 Embeddings”, arXiv preprint abs/2104.03502, 2021. DOI:10.48550/ARXIV.2104.03502.
[9]Eric Jang, Shixiang Gu, Ben Poole, “Categorical Reparameterization With Gumbel-Softmax”, arXiv preprint abs/1611.01144, 2016. DOI:10.48550/ARXIV.1611.01144.
[10]Feng Li, Jiusong Luo, Wanjun Xia, “WavFusion: Towards wav2vec 2.0 Multimodal Speech Emotion Recognition”, arXiv preprint abs/2412.05558, 2024. DOI:10.48550/ARXIV.2412.05558.
[11]Tao Meng, Yuntao Shou, Wei Ai, Nan Yin, Keqin Li, “Deep Imbalanced Learning for Multimodal Emotion Recognition in Conversations”, IEEE Transactions on Artificial Intelligence, Vol.5, No.12, pp.6472–6487, 2024. DOI:10.1109/TAI.2024.3445325.
[12]Leyuan Qu, Wei Wang, Cornelius Weber, Pengcheng Yue, Taihao Li, Stefan Wermter, “Improving Speech Emotion Recognition With Unsupervised Speaking Style Transfer”, Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol.1, No.1, pp.10101–10105, 2024. DOI:10.1109/ICASSP48485.2024.10446186.
[13]Junghun Kim, Yoojin An, Jihie Kim, "Improving Speech Emotion Recognition Through Focus and Calibration Attention Mechanisms", arXiv preprint, 2022. DOI/URL: https://arxiv.org/abs/2208.10491.
[14]Yanlin Liu, Aibin Chen, Guoxiong Zhou, Jizheng Yi, Jin Xiang, Yaru Wang, “Combined CNN LSTM With Attention for Speech Emotion Recognition Based on Feature Level Fusion”, Multimedia Tools and Applications, Vol.83, No.21, pp.59839–59859, 2024. DOI:10.1007/s11042 023 17829 x.
[15]Robert Toft, “Digital Audio: Sampling, Dither, Aliasing, and Bit Depth”, Technical Matters, Popular Music Forum, Western University, 4–5 May 2024. URL: https://ir.lib.uwo.ca/popmusicforum_techmatters/11/
[16]Weikuan Jia, Meili Sun, Jian Lian, Sujuan Hou, “Feature Dimensionality Reduction: A Review”, Complex & Intelligent Systems, Vol.8, No.3, pp.2663–2693, 2022. DOI:10.1007/s40747-021-00637-x.
[17]Daniel Hirst, Céline De Looze, "Measuring Speech: Fundamental Frequency and Pitch", Cambridge Handbook of Phonetics, Cambridge University Press, pp.336–361, 2021. DOI:10.1017/9781108644198.014.
[18]David Hason Rudd, Huan Huo, Guandong Xu, “Leveraged Mel Spectrograms Using Harmonic and Percussive Components in Speech Emotion Recognition”, Advances in Knowledge Discovery and Data Mining: 26th Pacific Asia Conference, PAKDD 2022, Proceedings, Springer, Vol.13281, pp.392–404, 2022. DOI:10.1007/978 3 031 05936 0_31.
[19]Orhan Özhan, “Short Time Fourier Transform”, in Basic Transforms for Electrical Engineering, Springer International Publishing, Cham, pp.441–464, 2022. DOI:10.1007/978 3 030 98846 3.
[20]Premjeet Singh, Goutam Saha, Md Sahidullah, “Non Linear Frequency Warping Using Constant Q Transformation for Speech Emotion Recognition”, Proceedings of the 2021 International Conference on Computer Communication and Informatics (ICCCI), IEEE, pp.1–6, 2021. DOI:10.1109/ICCCI50826.2021.9402569.
[21]Zrar Kh. Abdul, Abdulbasit K. Al Talabani, "Mel Frequency Cepstral Coefficient and Its Applications: A Review", IEEE Access, Vol.10, pp.122136–122158, 2022. DOI:10.1109/ACCESS.2022.3223444.
[22]Selva Birunda, S., Kanniga Devi, R., "A Review on Word Embedding Techniques for Text Classification", Innovative Data Communication Technologies and Application (Proceedings of ICIDCA 2020), Springer, pp.267–281, 2021. DOI:10.1007/978-981-15-9651-3_23.