Acoustic Modeling of Bangla Words using Deep Belief Network

Full Text (PDF, 807KB), PP.19-27

Views: 0 Downloads: 0


Mahtab Ahmed 1 Pintu Chandra Shill 1 Kaidul Islam 1 M. A. H. Akhand 1

1. Dept. of Computer Science and Engineering Khulna University of Engineering & Technology (KUET), Khulna-9203, Bangladesh

* Corresponding author.


Received: 4 May 2015 / Revised: 25 Jun. 2015 / Accepted: 7 Aug. 2015 / Published: 8 Sep. 2015

Index Terms

Speech Recognition, Hidden Markov Model, Gaussian Mixture Model, Deep Belief Network, Restricted Boltzmann Machine


Recently, speech recognition (SR) has drawn a great attraction to the research community due to its importance in human-computer interaction bearing scopes in many important tasks. In a SR system, acoustic modelling (AM) is crucial one which contains statistical representation of every distinct sound that makes up the word. A number of prominent SR methods are available for English and Russian languages with Deep Belief Network (DBN) and other techniques with respect to other major languages such as Bangla. This paper investigates acoustic modeling of Bangla words using DBN combined with HMM for Bangla SR. In this study, Mel Frequency Cepstral Coefficients (MFCCs) is used to accurately represent the shape of the vocal tract that manifests itself in the envelope of the short time power spectrum. Then DBN is trained with these feature vectors to calculate each of the phoneme states. Later on enhanced gradient is used to slightly adjust the model parameters to make it more accurate. In addition, performance on training RBMs improved by using adaptive learning, weight decay and momentum factor. Total 840 utterances (20 utterances for each of 42 speakers) of the words are used in this study. The proposed method is shown satisfactory recognition accuracy and outperformed other prominent existing methods.

Cite This Paper

Mahtab Ahmed, Pintu Chandra Shill, Kaidul Islam, M. A. H. Akhand,"Acoustic Modeling of Bangla Words using Deep Belief Network", IJIGSP, vol.7, no.10, pp.19-27, 2015. DOI: 10.5815/ijigsp.2015.10.03


[1]J. S. Devi et al., "Speaker Emotion Recognition Based on Speech Features and Classification Techniques," International Journal of Computer Network and Information Security, vol. 6, no.7, pp. 61–67, 2014. 

[2]Saloni et al., "Classification of High Blood Pressure Persons vs Normal Blood Pressure Persons Using Voice Analysis," International Journal of Image, Graphics and Signal Processing, vol. 6, no. 1, pp. 47–52, 2014. 

[3]C. H. Lee, "Speech Recognition and Production by Machines," International Encyclopedia of the Social & Behavioral Sciences (Second Edition), pp. 259–263, 2015.

[4]T. L. Nwe, S. W. Foo, and L. C De Silva, "Speech emotion recognition using hidden Markov models," Speech Communication, vol. 41, no. 4, pp. 603–623, 2003. 

[5]N. Najkar, F. Razzazi, and H. Sameti, "A novel approach to HMM-based speech recognition systems using particle swarm optimization," Mathematical and Computer Modeling, vol. 52, no. 11–12, pp. 1910–1920, 2010.

[6]X. Cui, M. Afify, Y. Gao, B. Zhou, "Stereo hidden Markov modeling for noise robust speech recognition," Computer Speech & Language, vol. 27, no. 2, pp. 407–419, 2013. 

[7]Daniel Povey et al., "The subspace Gaussian mixture model—A structured model for speech recognition," Computer Speech & Language, vol. 25, no. 2, pp. 404–439, 2011.

[8]S. So, and K. K. Paliwal, "Scalable distributed speech recognition using Gaussian mixture model-based block quantisation," Speech Communication, vol. 48, no. 6, pp. 746–758, 2006.

[9]L. R. Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition," in Proc. of the IEEE Browse Journals & Magazines, vol. 77, no. 2, pp. 257–287, 1989.

[10]J. Baker et al., "Developments and directions in speech recognition and under-standing, part 1," IEEE Signal Processing Mag., vol. 26, no. 3, pp. 75–80, May 2009.

[11]S. Furui, Digital Speech Processing, Synthesis, and Recognition, New York: Marcel Dekker, 2000.

[12]H. Bourlard and N. Morgan, Connectionist Speech Recognition: A Hybrid Approach, Norwell, MA: Kluwer, 1993.

[13]A. Mohamed, G. E. Dahl, and G. E. Hinton, "Deep belief networks for phone recognition," in NIPS Workshop on Deep Learning for Speech Recognition and Related Applications, 2009.

[14]S. Young, "Large Vocabulary Continuous Speech Recognition: A Review," IEEE Signal Processing Magazine, vol. 13, no. 5, pp. 45–57, 1996. 

[15]G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, "Phone recognition with the mean-covariance restricted Boltzmann machine," in Advances in Neural Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, Eds., pp. 469–477. 2010.

[16]A. R. Mohamed et al., "Acoustic Modeling using Deep Belief Networks," IEEE Trans. on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 14–22, 2012.

[17]M. Zulkarneev et al., "Acoustic Modeling with Deep Belief Networks for Russian Speech Recognition," Lecture Notes in Computer Science, vol. 8113, pp. 17–24, 2013.

[18]M. A. Ali, M. Hossain and M. N. Bhuiyan, "Automatic Speech Recognition Technique for Bangla Words," International Journal of Advanced Science and Technology, vol. 50, pp. 52–60, 2013.

[19]G. Mohammad, Y. A. Alotaibi, and M. N. Huda, "Automatic Speech Recognition Technique for Bangla Digits," in Proc. of 2009 12th International Conference on Computer and Information Technology(ICCIT 2009), pp. 379–383, 2009.

[20]M. A. Hossain, M. M. Rahman, U. K. Prodhan and M. F.Khan, "Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech Recognition," International Journal of Information Sciences and Techniques (IJIST), vol. 3, no. 4, pp. 1–9, 2013. 

[21]A. Fischer and C. Igel, "Empirical analysis of the divergence of Gibbs sampling based learning algorithms for restricted Boltzmann machines," in Proc. of the 20th International Conference on Artificial Neural Networks: Part III, pp. 208–217, 2010.

[22]K. Cho, "Improved Learning Algorithms for Restricted Boltzmann Machines," M.S. thesis, Dept. Deg. Prog. CSE, Aalto Univ., Alto, Espoo chp-4, 2011.

[23]T. R. Sahoo, S. Patra, "Silence Removal and Endpoint Detection of Speech Signal for Text Independent Speaker Identification," International Journal of Image, Graphics and Signal Processing, vol. 6, no. 6, pp. 27–35, 2014.

[24]E. H. Bourouba, et al., "Isolated Words Recognition System Based on Hybrid Approach DTW/GHMM," Informatics, vol. 30, pp. 373–384, 2006.

[25]H. Larochelle et al., "An empirical evaluation of deep architectures on problems with many factors of variation," in Proc. Of the 24th International Conference on Machine Learning, pp. 473–480, 2007.

[26]G. E. Hinton, "Training products of experts by minimizing contrastive divergence," Neural Computation, vol. 14, no. 8, pp. 1771–1800, 2002.

[27]M. Li, T. Zhang, Y. Chen and A. J. Smola, "Efficient Mini-batch Training for Stochastic Optimization," in Proc. of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pp. 661–670, 2014.

[28]G. Joya et al., "Hopfield neural networks for optimization: study of the different dynamics," Neurocomputing, vol. 43, pp. 219–237, 2002.