Automatic Environmental Sound Recognition (AESR) Using Convolutional Neural Network

Full Text (PDF, 1011KB), PP.41-45

Views: 0 Downloads: 0


Md. Rayhan Ahmed 1,* Towhidul Islam Robin 1 Ashfaq Ali Shafin 1

1. Department of Computer Science and Engineering, Stamford University Bangladesh, Dhaka, Bangladesh

* Corresponding author.


Received: 25 Mar. 2020 / Revised: 12 Apr. 2020 / Accepted: 8 May 2020 / Published: 8 Oct. 2020

Index Terms

AESR, CNN, Log-Mel Spectrogram, MFCC, Adam, RAdam, Relu, Image, Classification


Automatic Environmental Sound Recognition (AESR) is an essential topic in modern research in the field of pattern recognition. We can convert a short audio file of a sound event into a spectrogram image and feed that image to the Convolutional Neural Network (CNN) for processing. Features generated from that image are used for the classification of various environmental sound events such as sea waves, fire cracking, dog barking, lightning, raining, and many more. We have used the log-mel spectrogram auditory feature for training our six-layer stack CNN model. We evaluated the accuracy of our model for classifying the environmental sounds in three publicly available datasets and achieved an accuracy of 92.9% in the urbansound8k dataset, 91.7% accuracy in the ESC-10 dataset, and 65.8% accuracy in the ESC-50 dataset. These results show remarkable improvement in precise environmental sound recognition using only stack CNN compared to multiple previous works, and also show the efficiency of the log-mel spectrogram feature in sound recognition compared to Mel Frequency Cepstral Coefficients (MFCC), Wavelet Transformation, and raw waveform. We have also experimented with the newly published Rectified Adam (RAdam) as the optimizer. Our study also shows a comparative analysis between the Adaptive Learning Rate Optimizer (Adam) and RAdam optimizer used in training the model to correctly classifying the environmental sounds from image recognition architecture.

Cite This Paper

Md. Rayhan Ahmed, Towhidul Islam Robin, Ashfaq Ali Shafin, " Automatic Environmental Sound Recognition (AESR) Using Convolutional Neural Network", International Journal of Modern Education and Computer Science(IJMECS), Vol.12, No.5, pp. 41-54, 2020.DOI: 10.5815/ijmecs.2020.05.04


[1]A. Rabaoui, M. Davy, S. Rossignol, N. Ellouze, “Using one-class SVMs and wavelets for audio surveillance,” IEEE Transactions on information forensics and security 3 (4), 763–775, 2008.
[2]M. V. Ghiurcau, C. Rusu, R. C., Bilcu, J. Astola, “Audio based solutions for detecting intruders in wild areas,” Signal Processing 92 (3), 829–840, 2012.
[3]J.-C. Wang, H.-P. Lee, J.-F. Wang, and C.-B. Lin, “Robust Environmental Sound Recognition for Home Automation,” IEEE Transactions on Automation Science and Engineering, vol. 5, no. 1, pp. 25–31, Jan. 2008.
[4]Mydlarz, C.; Salamon, J.; Bello, J.P. “The implementation of low-cost urban acoustic monitoring devices,” in Appl. Acoust. 2016, 117, 207–218.
[5]H. Zhang, I. McLoughlin, and Y. Song, “Robust sound event recognition using convolutional neural networks,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD, 2015, pp. 559-563.
[6]O. Gencoglu, T. Virtanen and H. Huttunen, “Recognition of acoustic events using deep neural networks,” in 2014 22nd European Signal Processing Conference (EUSIPCO), Lisbon, 2014, pp. 506-510.
[7]S. Chachada and C. -. J. Kuo, “Environmental sound recognition: A survey,” in 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, Kaohsiung, 2013, pp. 1-9.
[8]K. Yao, J. Yang, X. Zhang, C. Zheng and X. Zeng, “Robust Deep Feature Extraction Method for Acoustic Scene Classification,” in 2019 IEEE 19th International Conference on Communication Technology (ICCT), Xi’an, China, 2019, pp. 198-202.
[9]E. R. Swedia, A. B. Mutiara, M. Subali, and Ernastuti, “Deep Learning Long-Short Term Memory (LSTM) for Indonesian Speech Digit Recognition using LPC and MFCC Feature,” in 2018 Third International Conference on Informatics and Computing (ICIC), Palembang, Indonesia, 2018, pp. 1-5.
[10]Tokozume, Yuji, and T. Harada. “Learning environmental sounds with end-to-end convolutional neural network,” 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 2721-2725. IEEE, 2017.
[11]D. Barchiesi, D. Giannoulis, D. Stowell and M. D. Plumbley, “Acoustic Scene Classification: Classifying environments from the sounds they produce,” in IEEE Signal Processing Magazine, vol. 32, no. 3, pp. 16-34, May 2015.
[12]Theodorou, T. Mporas, I. Fakotakis, N, “Automatic Sound Recognition of Urban Environment Events,” Springer International Publishing: Berlin/Heidelberg, Germany, 2015; pp. 129–136
[13]K. J. Piczak, “Environmental sound classification with convolutional neural networks,” 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), Boston, MA, 2015, pp. 1-6.
[14]Salamon and J. P. Bello, “Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification,” IEEE Signal Processing Letters, vol. 24, no. 3, pp. 279-283, March 2017.
[15]C. Szegedy et al., “Going deeper with convolutions,” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, 2015, pp. 1-9.
[16]Boddapati, Venkatesh, Andrej Petef, Jim Rasmusson, and Lars Lundberg, “Classifying environmental sounds using image recognition networks,” Procedia computer science 112 (2017): 2048-2056.
[17]J. Sang, S. Park and J. Lee, “Convolutional Recurrent Neural Networks for Urban Sound Classification Using Raw Waveforms,” 2018 26th European Signal Processing Conference (EUSIPCO), Rome, 2018, pp. 2444-2448.
[18]Dai, W. Dai, C. Qu, S. Li, J. Das, S. “Very deep convolutional neural networks for raw waveforms,” Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 421–425.
[19]Tokozume, Y., Ushiku, Y., Harada, T. “Learning from between-class examples for deep sound recognition,” arXiv preprint arXiv: 1711.10282 (2018).
[20]X. Zhang, Y. Zou and W. Shi, “Dilated convolution neural network with LeakyReLU for environmental sound classification,” 2017 22nd International Conference on Digital Signal Processing (DSP), London, 2017, pp. 1-5.
[21]Uzkent, Burak, Buket D. Barkana, and Hakan Cevikalp, “Non-speech environmental sound classification using SVMs with a new set of features,” International Journal of Innovative Computing, Information and Control 8, no. 5 (2012): 3511-3524.
[22]Li, Shaobo, Yong Yao, Jie Hu, Guokai Liu, Xuemei Yao, and Jianjun Hu. “An ensemble stacked convolutional neural network model for environmental event sound recognition.” Applied Sciences 8, no. 7 (2018): 1152.
[23]A. Khamparia, D. Gupta, N. G. Nguyen, A. Khanna, B. Pandey and P. Tiwari, “Sound Classification Using Convolutional Neural Network and Tensor Deep Stacking Network,” IEEE Access, vol. 7, pp. 7717-7727, 2019.
[24]L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, J. Han, “On The Variance of the adaptive learning rate and beyond,” arXiv:1908.03265v2 [cs.LG] 10 Mar 2020.
[25]da Silva, Bruno, Axel W Happi, An Braeken, and Abdellah Touhafi. “Evaluation of Classical Machine Learning Techniques towards Urban Sound Recognition on Embedded Systems.” Applied Sciences 9, no. 18 (2019): 3885.
[26]Piczak, Karol J. “ESC: Dataset for environmental sound classification.” Proceedings of the 23rd ACM international conference on Multimedia, pp. 1015-1018. 2015.
[27]Salamon, Justin, Christopher Jacoby, and Juan Pablo Bello. “A dataset and taxonomy for urban sound research.” Proceedings of the 22nd ACM international conference on Multimedia, pp. 1041-1044. 2014.