Accelerating Training of Deep Neural Networks on GPU using CUDA

Full Text (PDF, 272KB), PP.18-26

Views: 0 Downloads: 0


D.T.V. Dharmajee Rao 1,* K.V. Ramana 2

1. Aditya Institute of Technology and Management, Tekkali-532201, Srikakulam, Andhra Pradesh, India

2. JNTUK College of Engineering, JNTUK University, Kakinada - 533003, Andhra Pradesh, India

* Corresponding author.


Received: 15 Oct. 2018 / Revised: 16 Nov. 2018 / Accepted: 13 Dec. 2018 / Published: 8 May 2019

Index Terms

Deep Neural Networks, Matrix multiplication, CUDA, Many-core GPU systems


The development of fast and efficient training algorithms for Deep Neural Networks has been a subject of interest over the past few years because the biggest drawback of Deep Neural Networks is enormous cost in computation and large time is consumed to train the parameters of Deep Neural Networks. This aspect motivated several researchers to focus on recent advancements of hardware architectures and parallel programming models and paradigms for accelerating the training of Deep Neural Networks. We revisited the concepts and mechanisms of typical Deep Neural Network training algorithms such as Backpropagation Algorithm and Boltzmann Machine Algorithm and observed that the matrix multiplication constitutes major portion of the work-load for the Deep Neural Network training process because it is carried out for a huge number of times during the training of Deep Neural Networks. With the advent of many-core GPU technologies, a matrix multiplication can be done very efficiently in parallel and this helps a lot training a Deep Neural Network not consuming time as it used to be a few years ago. CUDA is one of the high performance parallel programming models to exploit the capabilities of modern many-core GPU systems. In this paper, we propose to modify Backpropagation Algorithm and Boltzmann Machine Algorithm with CUDA parallel matrix multiplication and test on many-core GPU system. Finally we discover that the planned strategies achieve very quick training of Deep Neural Networks than classic strategies.

Cite This Paper

D.T.V. Dharmajee Rao, K.V. Ramana, "Accelerating Training of Deep Neural Networks on GPU using CUDA", International Journal of Intelligent Systems and Applications(IJISA), Vol.11, No.5, pp.18-26, 2019. DOI:10.5815/ijisa.2019.05.03


[1]I-Hsin Chung et al., “Parallel Deep Neural Network Training for Big Data on Blue Gene/Q” IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 6, pp. 1703-1714, 2017. DOI: 10.1109/TPDS.2016.2626289.
[2]Canping Su et al., “An Efficient Deep Neural Networks Training Framework For Robust Face Recognition”, IEEE International Conference on Image Processing (ICIP), Beijing, China, pp. 3800-3804, 2017.
[3]Anala M.R. et al., “Comparative Study of Computationally Intensive Algorithms on CPU and GPU”, International Journal of Applied Engineering Research ISSN 0973-4562, vol. 11, no. 5, pp. 2996-2999, 2016.
[4]Mouna Afif, Yahia Said, and Mohamed Atri, “Efficient 2D Convolution Filters Implementations on Graphics Processing Unit Using NVIDIA CUDA”, I.J. Image, Graphics and Signal Processing, no. 8, pp. 1-8, 2018.
[5]Mohammad Usman Ashraf, Fadi Fouz, and Fathy Alboraei Eassa, “Empirical Analysis of HPC Using Different Programming Models”, I.J. Modern Education and Computer Science, no. 6, pp. 27-34, 2016.
[6]Sunitha N.V., Raju K., and Niranjan N. Chiplunkar, “Performance Improvement of CUDA Applications by Reducing CPU-GPU Data Transfer Overhead”, International Conference on Inventive Communication and Computational Technologies, Coimbatore, India, pp. 211-215, 2017.
[7]Arun Kumar Parakh, M.Balakrishnan, and Kolin Paul, “Performance Estimation of GPUs with Cache” IEEE 26th international Parallel and Distributed Processing Symposium Workshops & Ph D Forum, Shanghai, China, pp. 2384-2393. 2012. DOI: 10.1109/IPDPSW.2012.328
[8]Sapna Saxena and Neha Kishore, “PRDSA: Effective Parallel Digital Signature Algorithm for GPUs”, I.J. Wireless and Microwave Technologies. No. 5, pp. 14-21, 2017.
[9]Ke Yan, Junming Shan, and Eryan Yang, “CUDA-based Acceleration of the JPEG Decoder”, Ninth International Conference on Natural Computation (ICNC), Shenyang, China, pp. 1319-1323, 2013.
[10]Musab COSKUN et al., “An Overview of Popular Deep Learning Methods”, European Journal of Technic, vol. 7, no. 2, pp. 164-175. 2017. DOI: 10.23884/ejt.2017.7.2.11
[11]Sivakumar Selvarasu, Ganesan Periyanagounder, and Sundar Subbiah, “A MMDBM Classifier with CPU and CUDA GPU Computing in Various Sorting Procedures”, The International Arab Journal of Information Technology, vol. 14, no. 6, pp. 897-906, 2017.
[12]Shunlu Zhang, Pavan Gunupudi, and Qi-Jun Zhang, “Parallel Back-Propagation Neural Network Training Technique Using CUDA on Multiple GPUs”, IEEE MTT-S International Conference on Numerical Electromagnetic and Multiphysics Modeling and Optimization (NEMO), Ottawa, Canada, pp. 1-3, 2015.
[13]Teng Li et al., “Optimized Deep Belief Networks on CUDA GPUs”, International Joint Conference on Neural Networks (IJCNN), Killarney, Ireland, pp. 1-8, 2015.
[14]Adwa S. Al-Hamoudi, and A. Ahmed Biyabani, “Accelerating Data Mining with CUDA and Open MP”, IEEE/ACS 11th International Conference on Computer Systems and Applications (AICCSA), Doha, Qatar, pp. 528-535, 2014.
[15]Bakulev Aleksandr Valerievich et al., “The implementation on CUDA Platform Parallel Algorithms Sort the Data”, 6th Mediterranean Conference on Embedded Computing (MECO), Bar, Montenegro, pp. 1-4, 2017.
[16]Neetu Faujdar, and Satya Prakash Ghrera, “A Practical Approach of GPU Bubble Sort with CUDA Hardware”, 7th International Conference on Cloud Computing, Data Science & Engineering - Confluence, Noida, India, pp. 7-12, 2017.
[17]T. Kalaiselvi, P. Sriramakrishnan, and K. Somasundraram, “Performance of Medical Image Processing Algorithms Implemented in CUDA running on GPU based Machine”, I.J. Intelligent Systems and Applications, no. 1, pp. 58-68, 2018.
[18]Teja U. Naik and Nitesh Guinde, “Implementing the Gauss Seidel Algorithm for Solving Eigenvalues of Symmetric Matrices with CUDA”, IEEE International Conference on Computing Methodologies and Communication, Erode, India, pp. 922-925, 2017.
[19]Keh Kok Yong, Meng Wei Chua, and Wing Kent Ho, “CUDA Lossless Data Compression Algorithms: A Comparative Study”, IEEE Conference on Open Systems (ICOS), Langkawi, Malaysia, pp. 7-12, 2016.
[20]Pai-Wei Lai et al., “Accelerating Strassen-Winograd’s Matrix Multiplication Algorithm on GPUs”, 20th Annual International Conference on High Performance Computing, Bangalore, India, pp. 139-148, 2013.k
[21]Zhilu Chen et al., “A Fast Deep Learning System Using GPU”, IEEE International Symposium on Circuits and Systems (ISCAS), Melbourne VIC, Australia, pp. 1552-1555, 2014.
[22]Javier A. Cruz-lopez, Vincent Boyer, and Didier El-Baz, “Training Many Neural Networks in Parallel via Back-Propagation”, IEEE International Parallel and Distributed Processing Symposium Workshops, Lake Buena Vista, FL, USA , pp. 501-509, 2017. DOI: 10.1109/IPDPSW.2017.72
[23]Lei Jin et al., “Training Large Scale Deep Neural Networks on the Intel Xeon Phi Many-core Coprocessor,” IEEE 28th International Parallel & Distributed Processing Symposium Workshops, Phoenix, AZ, USA, pp. 1622-1630, 2014. DOI: 10.1109/IPDPSW.2014.194
[24]Noel Lopes, Bernardete Ribeiro, and Joao Goncalves, “Restricted Boltzmann Machines and Deep Belief Networks on Multi-Core Processors”, WCCI 2012 IEEE World Congress on Computational Intelligence, Brisbane, QLD, Australia, pp. 1-7, 2012.