A Robust Hybrid Deep Learning Model for Multiclass Depression Classification from Speech Audio

PDF (454KB), PP.124-136

Views: 0 Downloads: 0

Author(s)

Neny Sulistianingsih 1,* Galih Hendro Martono 1

1. Computer Science, Bumigora University, Mataram, Indonesia

* Corresponding author.

DOI: https://doi.org/10.5815/ijigsp.2026.02.08

Received: 17 Jun. 2025 / Revised: 15 Nov. 2025 / Accepted: 2 Feb. 2026 / Published: 8 Apr. 2026

Index Terms

Depression Detection, Speech Emotion Recognition, Hybrid Deep Learning, CNN, Transformer, GRU, BiLSTM, Mental Health Assessment

Abstract

Depression remains one of the most prevalent and underdiagnosed mental health disorders globally, necessitating scalable, objective, and non-invasive diagnostic tools. Speech, as a rich biomarker of emotional and psychological states, offers a promising avenue for automated depression detection. This study proposes a robust hybrid deep learning framework that integrates Convolutional Neural Networks (CNN), Gated Recurrent Units (GRU), Bidirectional Long Short-Term Memory (BiLSTM), and Transformer architectures to classify depression severity into three levels: normal, mild, and severe. Using a curated multimodal dataset comprising 400 labeled audio recordings, we extract comprehensive acoustic features, including MFCC, Chroma, Spectrogram, Contrast, and Tonnetz representations. Models are evaluated using precision, recall, F1-score, and accuracy. Experimental results show that the proposed hybrid models outperform traditional architectures, achieving up to 99% accuracy and strong generalization across all classes. This study demonstrates the potential of attention-enhanced hybrid architectures in mental health assessment and provides a foundation for future deployment in clinical and real-world settings. Future work includes multimodal fusion with EEG data and the implementation of explainable AI for clinical interpretability.

Cite This Paper

Neny Sulistianingsih, Galih Hendro Martono, "A Robust Hybrid Deep Learning Model for Multiclass Depression Classification from Speech Audio", International Journal of Image, Graphics and Signal Processing(IJIGSP), Vol.18, No.2, pp. 124-136, 2026. DOI:10.5815/ijigsp.2026.02.08

Reference

[1]G. H. Estimates, "Depression and Other Common Mental Disorders Global Health Estimates," 2017.
[2]S. S. Leal, S. Ntalampiras, and R. Sassi, "Speech-based Depression Assessment: A Comprehensive Survey," IEEE Trans. Affect. Comput., vol. PP, no. 8, pp. 1–16, 2024, doi: 10.1109/TAFFC.2024.3521327.
[3]N. K. Iyortsuun, S. H. Kim, H. J. Yang, S. W. Kim, and M. Jhon, "Additive Cross-Modal Attention Network (ACMA) for Depression Detection Based on Audio and Textual Features," IEEE Access, vol. 12, no. January, pp. 20479–20489, 2024, doi: 10.1109/ACCESS.2024.3362233.
[4]D. K. Saha, T. Hossain, M. Safran, S. Alfarhood, M. F. Mridha, and D. Che, "Ensemble of hybrid model based technique for early detecting of depression based on SVM and neural networks," Sci. Rep., vol. 14, no. 1, pp. 1–18, 2024, doi: 10.1038/s41598-024-77193-0.
[5]S. Sardari, B. Nakisa, M. N. Rastgoo, and P. Eklund, "Audio-based depression detection using Convolutional Autoencoder," Expert Syst. Appl., vol. 189, no. September 2021, p. 116076, 2022, doi: 10.1016/j.eswa.2021.116076.
[6]C. Sun, M. Jiang, L. Gao, Y. Xin, and Y. Dong, "A novel study for depression detecting using audio signals based on graph neural network," Biomed. Signal Process. Control, vol. 88, no. October 2023, 2024, doi: 10.1016/j.bspc.2023.105675.
[7]A. K. Das and R. Naskar, "A deep learning model for depression detection based on MFCC and CNN generated spectrogram features," Biomed. Signal Process. Control, vol. 90, no. November 2023, p. 105898, 2024, doi: 10.1016/j.bspc.2023.105898.
[8]X. Zhang, X. Zhang, W. Chen, C. Li, and C. Yu, "Improving speech depression detection using transfer learning with wav2vec 2.0 in low-resource environments," Sci. Rep., vol. 14, no. 1, pp. 1–14, 2024, doi: 10.1038/s41598-024-60278-1.
[9]W. Chen, X. Xing, X. Xu, J. Pang, and L. Du, "SpeechFormer++: A Hierarchical Efficient Framework for Paralinguistic Speech Processing," IEEE/ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 775–788, 2023, doi: 10.1109/TASLP.2023.3235194.
[10]Y. Sun, Y. Zhou, X. Xu, J. Qi, and F. Xu, "Weakly-Supervised Depression Detection in Speech Through Self-Learning Based Label Correction," IEEE Trans. AUDIO, SPEECH Lang. Process., vol. 33, no., pp. 748–758, 2025, doi: 10.1109/TASLPRO.2025.3533370.
[11]J. Ye et al., "Multimodal depression detection based on emotional audio and evaluation text," J. Affect. Disord., vol. 295, no. February, pp. 904–913, 2021, doi: 10.1016/j.jad.2021.08.090.
[12]L. L. Gan, Y. Huang, X. Gao, J. Tan, F. Zhao, and T. Yang, "Multimodal Magic : Elevating Depression Detection with a Fusion of Text and Audio Intelligence."
[13]S. Yang et al., "Fine-grained multimodal fusion for depression assisted recognition based on hierarchical knowledge-enhanced prompt learning," Expert Syst. Appl., vol. 291, no. December 2024, p. 128532, 2025, doi: 10.1016/j.eswa.2025.128532.
[14]J. Ye et al., "DEP-Former: Multimodal Depression Recognition Based on Facial Expressions and Audio Features via Emotional Changes," IEEE Trans. Circuits Syst. Video Technol., vol. 35, no. 3, pp. 2087–2100, 2024, doi: 10.1109/TCSVT.2024.3491098.
[15]L. Zhu et al., "Explainable Depression Classification Based on EEG Feature Selection from Audio Stimuli," IEEE Trans. Neural Syst. Rehabil. Eng., vol. 33, pp. 1411–1426, 2025, doi: 10.1109/TNSRE.2025.3557275.
[16]X. Xu, Y. Wang, X. Wei, F. Wang, and X. Zhang, "Attention-based acoustic feature fusion network for depression detection," Neurocomputing, vol. 601, 2024, doi: 10.1016/j.neucom.2024.128209.
[17]E. Lim, M. Jhon, J. W. Kim, S. H. Kim, S. Kim, and H. J. Yang, "A lightweight approach based on cross-modality for depression detection," Comput. Biol. Med., vol. 186, no. September 2024, p. 109618, 2025, doi: 10.1016/j.compbiomed.2024.109618.
[18]L. Liu, S. Chen, K. Chen, J. Xu, and X. Chen, "Catching the Blackdog Easily: A Convenient Depression Diagnosis Method based on Audio-Visual Deep Learning," IEEE Trans. Affect. Comput., vol. PP, pp. 1–16, 2025, doi: 10.1109/TAFFC.2025.3571697.
[19]A. Qayyum, I. Razzak, M. Tanveer, M. Mazher, and B. Alhaqbani, "High-Density Electroencephalography and Speech Signal Based Deep Framework for Clinical Depression Diagnosis," IEEE/ACM Trans. Comput. Biol. Bioinforma., vol. 20, no. 4, pp. 2587–2597, 2023, doi: 10.1109/TCBB.2023.3257175.
[20]F. Tian et al., "Advancements in Affective Disorder Detection: Using Multimodal Physiological Signals and Neuromorphic Computing Based on SNNs," IEEE Trans. Comput. Soc. Syst., vol. 11, no. 6, pp. 1–29, 2024, doi: 10.1109/tcss.2024.3420445.
[21]Y. Hu, J. Chen, J. Chen, W. Wang, S. Zhao, and X. Hu, "An Ensemble Classification Model for Depression Based on Wearable Device Sleep Data," IEEE J. Biomed. Heal. Informatics, vol. 28, no. 5, pp. 2602–2612, 2024, doi: 10.1109/JBHI.2023.3258601.
[22]H. Fan et al., "Transformer-based multimodal feature enhancement networks for multimodal depression detection integrating video, audio and remote photoplethysmograph signals," Inf. Fusion, vol. 104, no. July 2023, p. 102161, 2024, doi: 10.1016/j.inffus.2023.102161.
[23]J. R. Williamson, D. Young, A. A. Nierenberg, J. Niemi, B. S. Helfer, and T. F. Quatieri, "Tracking depression severity from audio and video based on speech articulatory coordination," Comput. Speech Lang., vol. 55, pp. 40–56, 2019, doi: 10.1016/j.csl.2018.08.004.
[24]Hitler, "Multimodal dataset for depression analysis," Kaggle, 2025. https://www.kaggle.com/datasets/s3programmerlead/multimodal-dataset-for-depression-analysis (accessed Feb. 16, 2025).
[25]P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou, "End-to-End Multimodal Emotion Recognition Using Deep Neural Networks," IEEE J. Sel. Top. Signal Process., vol. 11, no. 8, pp. 1301–1309, 2017, doi: 10.1109/JSTSP.2017.2764438.
[26]K. Daly and O. Olukoya, "Depression detection in read and spontaneous speech: A Multimodal approach for lesser-resourced languages," Biomed. Signal Process. Control, vol. 108, no. May 2024, 2025, doi: 10.1016/j.bspc.2025.107959.
[27]E. Douglas-Cowie et al., "Multimodal databases of everyday emotion: Facing up to complexity," 9th Eur. Conf. Speech Commun. Technol., pp. 813–816, 2005, doi: 10.21437/interspeech.2005-381.
[28]B. Schuller et al., "The INTERSPEECH 2016 computational paralinguistics challenge: Deception, sincerity & native language," Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 08-12-Sept, pp. 2001–2005, 2016, doi: 10.21437/Interspeech.2016-129.
[29]A. Khosla, P. Khandnor, and T. Chand, "Automated diagnosis of depression from EEG signals using traditional and deep learning approaches: A comparative analysis," Biocybern. Biomed. Eng., vol. 42, no. 1, pp. 108–142, 2022, doi: 10.1016/j.bbe.2021.12.005.
[30]B. McFee et al., "librosa: Audio and Music Signal Analysis in Python," Proc. 14th Python Sci. Conf., no. Scipy, pp. 18–24, 2015, doi: 10.25080/majora-7b98e3ed-003.
[31]M. Valstar et al., "AVEC 2014 - 3D dimensional affect and depression recognition challenge," AVEC 2014 - Proc. 4th Int. Work. Audio/Visual Emot. Challenge, Work. MM 2014, no. January 2021, pp. 3–10, 2014, doi: 10.1145/2661806.2661807.
[32]J. Salamon and J. P. Bello, "Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification," IEEE Signal Process. Lett., vol. 24, no. 3, pp. 279–283, 2017, doi: 10.1109/LSP.2017.2657381.
[33]G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, "Improving neural networks by preventing co-adaptation of feature detectors," pp. 1–18, 2012, [Online]. Available: http://arxiv.org/abs/1207.0580.
[34]T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals, "Learning the speech front-end with raw waveform CLDNNs," Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2015-Janua, pp. 1–5, 2015, doi: 10.21437/interspeech.2015-1. 
[35]S. Hochreiter and J. ¨urgen Schmidhuber, “Long Short-Term Memory,” Neural Comput., vol. 9, pp. 1735–1780, 1997.
[36]K. Cho et al., "Learning phrase representations using RNN encoder-decoder for statistical machine translation," EMNLP 2014 - 2014 Conf. Empir. Methods Nat. Lang. Process. Proc. Conf., pp. 1724–1734, 2014, doi: 10.3115/v1/d14-1179.
[37]A. Vaswani et al., "Attention is all you need," Adv. Neural Inf. Process. Syst., vol. 2017-Decem, no. Nips, pp. 5999–6009, 2017.
[38]F. Weninger, J. R. Hershey, J. Le Roux, and B. Schuller, "Discriminatively trained recurrent neural networks for single-channel speech separation," 2014 IEEE Glob. Conf. Signal Inf. Process. Glob. 2014, pp. 577–581, 2014, doi: 10.1109/GlobalSIP.2014.7032183.
[39]A. Graves and J. Schmidhuber, "Framewise Phoneme Classi fi cation with Bidirectional LSTM Networks," Neural Networks, vol. 18, no. 5–6, 2005.
[40]N. Sulistianingsih and G. H. Martono, "Enhancing Predictive Models: An In-depth Analysis of Feature Selection Techniques Coupled with Boosting Algorithms," MATRIK  J. Manajemen, Tek. Inform. dan Rekayasa Komput., vol. 23, no. 2, pp. 353–364, 2024, doi: 10.30812/matrik.v23i2.3788.
[41]S. Yoon and H. J. Yu, "BPCNN: Bi-Point Input for Convolutional Neural Networks in Speaker Spoofing Detection," Sensors, vol. 22, no. 12, pp. 1–21, 2022, doi: 10.3390/s22124483.
[42]J. Duque, F. Silva, and A. Godinho, "Data mining applied to knowledge management," Procedia Comput. Sci., vol. 219, pp. 455–461, 2023, doi: 10.1016/j.procs.2023.01.312.
[43]T. Alhanai, M. Ghassemi, and J. Glass, "Detecting depression with audio/text sequence modeling of interviews," Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-Septe, no. September, pp. 1716–1720, 2018, doi: 10.21437/Interspeech.2018-2522.
[44]D. K. Sung, Y. Son, H. Eom, and S. Kim, "Improving I/O Performance via Address Remapping in NVMe Interface," IEEE Access, vol. 10, no. November, pp. 119722–119733, 2022, doi: 10.1109/ACCESS.2022.3221733.
[45]X. Lin, X. Zou, Z. Ji, T. Huang, S. Wu, and Y. Mi, "A brain-inspired computational model for spatio-temporal information processing," Neural Networks, vol. 143, no., pp. 74–87, 2021, doi: 10.1016/j.neunet.2021.05.015.