Enhanced Deep Hierarchal GRU & BILSTM using Data Augmentation and Spatial Features for Tamil Emotional Speech Recognition

Full Text (PDF, 1607KB), PP.45-63

Views: 0 Downloads: 0


J. Bennilo Fernandes 1,* Kasiprasad Mannepalli 1

1. Koneru Lakshmaiah Education Foundation, Guntur, Andhra Pradesh

* Corresponding author.

DOI: https://doi.org/10.5815/ijmecs.2022.03.03

Received: 7 May 2021 / Revised: 11 Jul. 2021 / Accepted: 28 Aug. 2021 / Published: 8 Jun. 2022

Index Terms

Data Augmentation, Spatial Features, LSTM, BILSTM, GRU, Emotion Recognition.


The Recurrent Neural Network (RNN) is well suited for emotional speech recognition because its uses constantly time shifting property. Even though RNN gives better results GRU, LSTM and BILSTM solves the gradient problem and overfitting problem joins the path to reduces the efficiency. Hence in this paper five deep learning architecture is designed in order to overcome the major issues using data augmentation and spatial feature. Five different architectures like: Enhanced Deep Hierarchal LSTM & GRU (EDHLG), EDHBG, EDHGL, EDHGB & EDHGG are developed with dropout layers. The raw data learned from LSTM will be given as the input to GRU layer for deepest learning. Thus, the gradient problem is reduced, and accuracy of each emotion was increased. Also, to enhance the accuracy level spatial features were concatenated with MFCC. Thus, in all models, the experimental evaluation with the Tamil emotional dataset yielded the best results. EDHLG has a 93.12% accuracy, EDHGL has a 92.56 percent accuracy, EDHBG has a 95.42 percent accuracy, EDHGB has a 96 percent accuracy, and EDHGG has a 94 percent accuracy. Furthermore, the average accuracy rate of a single individual LSTM layer is 74%, while BILSTM is 77%. EDHGB outperforms almost all other systems, by an optimal system of 94.27 percent and then a maximum overall accuracy of 95.99 percent. For the Tamil emotion data, emotional states such as happy, fearful, angry, sad, and neutral have a 100% prediction accuracy, while disgust has a 94 percent efficiency rate and boredom has an 82 percent accuracy rate. Also, the training time and evaluation time utilized by EDHGB is 4.43 mins and 0.42 mins which is less when compared with other models. Hence by changing the LSTM, BILSTM and GRU layers large analysis of experiment on Tamil dataset is done and EDHGB is superior to other models, and when compared with basic models LSTM and BILSTM around 26% more efficiency is gained.

Cite This Paper

J. Bennilo Fernandes, Kasiprasad Mannepalli, "Enhanced Deep Hierarchal GRU & BILSTM using Data Augmentation and Spatial Features for Tamil Emotional Speech Recognition", International Journal of Modern Education and Computer Science(IJMECS), Vol.14, No.3, pp. 45-63, 2022. DOI:10.5815/ijmecs.2022.03.03


[1]N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958. 2014.
[2]Yin Win Chit, Win Ei Hlaing, Myo Myo Khaing, " Myanmar Continuous Speech Recognition System Using Convolutional Neural Network ", International Journal of Image, Graphics and Signal Processing, Vol.13, No.2, pp. 44-52, 2021.
[3]K. Mannepalli, P. N Sastry, and M. Suman, “MFCC-GMM based accent recognition system for Telugu speech signals,” International Journal of Speech Technology, Vol: 19, Issue: 1, pp: 87 – 93. 2016.
[4]S. Ioffe, and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” Proc. of ICML, pp. 448–456. 2015.
[5]Ahmed Iqbal, Shabib Aftab, " A Classification Framework for Software Defect Prediction Using Multi-filter Feature Selection Technique and MLP ", International Journal of Modern Education and Computer Science, Vol.12, No.1, pp. 18-25, 2020.
[6]O. Abdel-Hamid, A. R. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional neural networks for speech recognition,” IEEE ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 10, pp. 1533–1545. 2014.
[7]A.S.C.S. Sastry, P.V.V. Kishore, C. Raghava Prasad, and M.V.D. Prasad, “Denoising ultrasound medical images: A block based hard and soft thresholding in wavelet domain,” Medical Imaging: Concepts, Methodologies, Tools, and Applications, Vol: Issue: pp: 761 – 775. 2016.
[8]Moner N. M. Arafa, Reda Elbarougy, A. A. Ewees, G. M. Behery," A Dataset for Speech Recognition to Support Arabic Phoneme Pronunciation", International Journal of Image, Graphics and Signal Processing, Vol.10, No.4, pp. 31-38, 2018.
[9]Vidyashree Kanabur, Sunil S Harakannanavar, Dattaprasad Torse, "An Extensive Review of Feature Extraction Techniques, Challenges and Trends in Automatic Speech Recognition", International Journal of Image, Graphics and Signal Processing, Vol.11, No.5, pp. 1-12, 2019.
[10]Y. Zhang, G. Chen, D. Yu, K. Yao, S. Khudanpur, and J.R. Glass, “Highway long short-term memory RNNS for distant speech recognition,” Proc. of ICASSP, pp. 5755–5759. 2016.
[11]Prashengit Dhar, Sunanda Guha," A System to Predict Emotion from Bengali Speech ", International Journal of Mathematical Sciences and Computing, Vol.7, No.1, pp. 26-35, 2021.
[12]K.V.V. Kumar, P.V.V. Kishore, and D. Anil Kumar, “Indian Classical Dance Classification with Adaboost Multiclass Classifier on Multi feature Fusion,” Mathematical Problems in Engineering, Vol:20, issue: 5, pp: 126 - 139, 2017.
[13]K. Mannepalli, P.N. Sastry, and M. Suman, “FDBN: Design and development of Fractional Deep Belief Networks for speaker emotion recognition,” International Journal of Speech Technology, Vol: 19, Issue: 4, pp: 779 – 790. 2016.
[14]J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach, “An overview of noise-robust automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 22, no. 4, pp. 745–777. 2014.
[15]G.A. Rao, K. Syamala, P.V.V. Kishore, and A.S.C.S. Sastry, “Deep convolutional neural networks for sign language recognition,” International Journal of Engineering and Technology (UAE), Vol: 7, Issue: 1.5, Special Issue 5, pp: 62 to 70, 2018.
[16]Yogesh Kumar, Navdeep Singh,"Automatic Spontaneous Speech Recognition for Punjabi Language Interview Speech Corpus", International Journal of Education and Management Engineering, Vol.6, No.6, pp.64-73, 2016.
[17]G.A. Rao, and P.V.V. Kishore, “Sign language recognition system simulated for video captured with smart phone front camera,” International Journal of Electrical and Computer Engineering, Vol: 6, Issue: 5, pp: 2176 – 2187, 2016.
[18]A. Schwarz, C. Huemmer, R. Maas, and W. Kellermann, “Spatial Diffuseness Features for DNN-Based Speech Recognition in Noisy and Reverberant Environments,” Proc. of ICASSP, pp. 4380–4384, 2015.
[19]M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, “Batch-normalized joint training for dnn-based distant speech recognition,” Proc. of SLT, pp. 28–34. 2016.
[20]Hajer Rahali, Zied Hajaiej, Noureddine Ellouze,"Robust Features for Speech Recognition using Temporal Filtering Technique in the Presence of Impulsive Noise", International Journal of Image, Graphics and Signal Processing, vol.6, no.11, pp.17-24, 2014.
[21]P.V.V. Kishore, and M.V.D. Prasad, “Optical flow hand tracking and active contour hand shape features for continuous sign language recognition with artificial neural network,” International Journal of Software Engineering and its Applications, Vol: 10, Issue: 2, pp: 149 – 170. 2016.
[22]M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, “A network of deep neural networks for distant speech recognition”. Proc. of ICASSP, pp. 4880–4884. 2017.
[23]S. Hochreiter. And J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780. 1997.
[24]G. Zhou, J. Wu, C. Zhang, and Z. Zhou, “Minimal gated unit for recurrent neural networks,” International Journal of Automation and Computing. 2016.
[25]Z. Chen, S. Watanabe, H. Erdogan, and J. Hershey, “Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks”. Proc. of Interspeech, pp. 3274–3278. 2015.
[26]F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. L. Roux, J. R. Hershey, and B. W. Schuller, “Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR,” Proc. of LVA/ICA, Vol: 9237, Pages 91–99, 2015.
[27]H. Erdogan, J.R. Hershey, S. Watanabe, and J. L Roux, “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,” Proc. of ICASSP, pp. 708–712. 2015.
[28]F. Eyben, F. Weninger, S. Squartini, and B. Schuller, “Real-life voice activity detection with LSTM Recurrent Neural Networks and an application to Hollywood movies,” IEEE International Conference on Acoustics, pp. 708-712, 2013.
[29]R. Jozefowicz, W. Zaremba, and I. Sutskever, “An empirical exploration of recurrent network architectures,” Proc. of ICML, pp. 2342–2350. 2015.
[30]J. Chung, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” Proc. Of NIPS, Vol: 37, Pages 2342–2350, 2014.
[31]C. Laurent, G. Pereyra, P. Brakel, Y. Zhang, and Y. Bengio, “Batch normalized recurrent neural networks,” Proc. of ICASSP, pp. 2657–2661, 2016.