Vijay Singh Rana; Ankush Joshi; Kamal Kant Verma

Lightweight 3DCNN-BiLSTM Model for Human Activity Recognition using Fusion of RGBD Video Sequences

PDF (1256KB), PP.176-193

Views: 0 Downloads: 0

Author(s)

Vijay Singh Rana ^1,* Ankush Joshi ² Kamal Kant Verma ³

1. Department of Computer Applications, College of Smart Computing Roorkee, COER University Roorkee India

2. Department of Computer Science and Applications, College of Smart Computing, COER University Roorkee India

3. School of Computer Science and Engineering, IILM University Greater Noida, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2025.06.10

Received: 11 Jun. 2025 / Revised: 26 Aug. 2025 / Accepted: 23 Oct. 2025 / Published: 8 Dec. 2025

Index Terms

Human Activity Recognition, Lightweight 3DCNN, Bidirectional LSTM, Decision Level Fusion, Depth Map

Abstract

Over the past two decades, the automatic recognition of human activities has been a prominent research field. This task becomes more challenging when dealing with multiple modalities, different activities, and various scenarios. Therefore, this paper addresses activity recognition task by fusion of two modalities such as RGB and depth maps. To achieve this, two distinct lightweight 3D Convolutional Neural Network (3DCNN) are employed to extract space time features from both RGB and depth sequences separately. Subsequently, a Bidirectional LSTM (Bi-LSTM) network is trained using the extracted spatial temporal features, generating activity score corresponding to each sequence in both RGB and depth maps. Then, a decision level fusion is applied to combine the score obtained in the previous step. The novelty of our proposed work is to introduce a lightweight 3DCNN feature extractor, designed to capture both spatial and temporal features form the RGBD video sequences. This improves overall efficiency while simultaneously reducing the computational complexity. Finally, the activities are recognized based the fusion scores. To assess the overall efficiency of our proposed lightweight-3DCNN and BiLSTM method, it is validated on the 3D benchmark dataset UTKinectAction3D, achieving an accuracy of 96.72%. The experimental findings confirm the effectiveness of the proposed representation over existing methods.

Cite This Paper

Vijay Singh Rana, Ankush Joshi, Kamal Kant Verma, "Lightweight 3DCNN-BiLSTM Model for Human Activity Recognition using Fusion of RGBD Video Sequences", International Journal of Information Technology and Computer Science(IJITCS), Vol.17, No.6, pp.176-193, 2025. DOI:10.5815/ijitcs.2025.06.10

Reference

[1]Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1510–1517 (2017)
[2]Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Thirty-First AAAI Conference on Artificial Intelligence, pp. 4263–4270 (2017)
[3]W. Li, Z. Zhang, and Z. Liu, “Action recognition based on a bag of 3d points,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, San Francisco, CA, USA, 2010, pp. 9-14.
[4]L. Xia, C.C. Chen, and J. K. Aggarwal, “View invariant human action recognition using histograms of 3d joints,” in Computer Vision and Pattern Recognition Workshops (CVPRW), Providence, RI, USA, 2012, pp. 20-27.
[5]Y. Zhu, W. Chen, and G. Guo, “Fusing spatiotemporal features and joints for 3d action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Portland, OR, USA, 2013, pp. 486-491.
[6]H. Wang, C. Schmid, Action recognition with improved trajectories, in: Proceedings of the IEEE International Conference on Computer Vision,2013, pp. 3551–3558.
[7]L. Fan, W. Huang, C. Gan, S. Ermon, B. Gong, J. Huang, End-to-end learning of motion representation for video understanding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp.6016–6025.
[8]K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in videos, in: Advances in Neural Information Processing Systems, 2014, pp. 568–576.
[9]Y. Shi, Y. Tian, Y. Wang, T. Huang, Sequential deep trajectory descriptor for action recognition with three-stream CNN, IEEE Trans. Multimed. 19 (7) (2017) 1510–1520.
[10]Ullah, A., Muhammad, K., Ding, W., Palade, V., Haq, I. U., & Baik, S. W. (2021). Efficient activity recognition using lightweight CNN and DS-GRU network for surveillance applications. Applied Soft Computing, 103, 107102.
[11]R. Zhao, H. Ali, P. Van der Smagt, Two-stream RNN/CNN for action recognition in 3D videos, in: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2017, pp. 4260–4267.
[12]M. Majd, R. Safabakhsh, Correlational convolutional LSTM for human action recognition, Neurocomputing 396 (2020) 224–229.
[13]Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., & Baik, S. W. (2017). Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE access, 6, 1155-1166.
[14]A. Dosovitskiy, et al., Flownet: Learning optical flow with convolutional networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2758–2766.
[15]Noor, N., & Park, I. K. (2023). A lightweight skeleton-based 3D-CNN for real-time fall detection and action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 2179-2188).
[16]Lin, L., Wu, J., An, R., Ma, S., Zhao, K., & Ding, H. (2024). LIMUNet: A Lightweight Neural Network for Human Activity Recognition Using Smartwatches. Applied Sciences, 14(22), 10515.
[17]Gupta, K., Singh, A., Yeduri, S. R., Srinivas, M. B., & Cenkeramaddi, L. R. (2023). Hand gestures recognition using edge computing system based on vision transformer and lightweight CNN. Journal of Ambient Intelligence and Humanized Computing, 14(3), 2601-2615.
[18]Huan, S., Wang, Z., Wang, X., Wu, L., Yang, X., Huang, H., & Dai, G. E. (2023). A lightweight hybrid vision transformer network for radar-based human activity recognition. Scientific Reports, 13(1), 17996.
[19]Choudhury, N. A., & Soni, B. (2023). An efficient and lightweight deep learning model for human activity recognition on raw sensor data in uncontrolled environment. IEEE Sensors Journal, 23(20), 25579-25586.
[20]Lizo, L., & Estrada, J. (2024, July). Lightweight Convolutional Neural Network (CNN) and Long Short-Term Memory Network (LSTM) for Dynamic Hand Gesture Recognition. In 2024 4th International Conference of Science and Information Technology in Smart Administration (ICSINTESA) (pp. 113-118). IEEE.
[21]Verma, K. K., Singh, B. M., & Dixit, A. (2022). A review of supervised and unsupervised machine learning techniques for suspicious behavior recognition in intelligent surveillance system. International Journal of Information Technology, 14(1), 397-410.
[22]Vrskova, R., Hudec, R., Kamencay, P., & Sykora, P. (2022). Human activity classification using the 3DCNN architecture. Applied Sciences, 12(2), 931.
[23]Verma, K. K., Singh, B. M., Mandoria, H. L., & Chauhan, P. (2020). Two-stage human activity recognition using 2D-ConvNet. IJIMAI, 6(2), 125-135.
[24]AlMuhaideb, S., AlAbdulkarim, L., AlShahrani, D. M., AlDhubaib, H., & AlSadoun, D. E. (2024). Achieving more with less: A lightweight deep learning solution for advanced human activity recognition (har). Sensors, 24(16), 5436.
[25]Basly, H., Ouarda, W., Sayadi, F. E., Ouni, B., & Alimi, A. M. (2022). DTR-HAR: deep temporal residual representation for human activity recognition. The Visual Computer, 38(3), 993-1013.
[26]Verma, K. K., & Singh, B. M. (2021, November). Vision based human activity recognition using deep transfer learning and support vector machine. In 2021 IEEE 8th Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON) (pp. 1-9). IEEE.
[27]Verma, P., Sah, A., & Srivastava, R. (2020). Deep learning-based multi-modal approach using RGB and skeleton sequences for human activity recognition. Multimedia Systems, 26(6), 671-685.
[28]Shafizadegan, F., Naghsh-Nilchi, A. R., & Shabaninia, E. (2024). Multimodal vision-based human action recognition using deep learning: A review. Artificial Intelligence Review, 57(7), 178.
[29]Rehman, S. U., Yasin, A. U., Ul Haq, E., Ali, M., Kim, J., & Mehmood, A. (2024). Enhancing Human Activity Recognition through Integrated Multimodal Analysis: A Focus on RGB Imaging, Skeletal Tracking, and Pose Estimation. Sensors, 24(14), 4646.
[30]Ji, S., Xu, W., Yang, M., & Yu, K. (2012). 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1), 221-231.
[31]K.-I. Funahashi and Y. Nakamura, “Approximation of dynamical systems by continuous time recurrent neural networks, Neural Netw”., vol. 6, no. 6, pp. 801-806, 1993.
[32]S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735-1780, 1997.
[33]S. Siami-Namini, N. Tavakoli, A.S. Namin, The performance of LSTM and BiLSTM in forecasting time series, in: 2019 IEEE International Conference on Big Data (Big Data), IEEE, 2019, pp. 3285–3292.
[34]Murad, A., & Pyun, J. Y. (2017). Deep recurrent neural networks for human activity recognition. Sensors, 17(11), 2556.
[35]Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv:1412.3555
[36]Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5-6), 602–610.
[37]Neverova, N., Wolf, C., Taylor, G., & Nebout, F. (2015). ModDrop: Adaptive multi-modal gesture recognition. IEEE TPAMI, 38(8), 1692–1706.
[38]Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term Recurrent Convolutional Networks for Visual Recognition and Description. CVPR.
[39]Dosovitskiy, A. et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929
[40]Xia, L., Chen, C. C., & Aggarwal, J. K. (2012, June). View invariant human action recognition using histograms of 3d joints. In 2012 IEEE computer society conference on computer vision and pattern recognition workshops (pp. 20-27). IEEE.
[41]Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 4489-4497).
[42]Hara, K., Kataoka, H., & Satoh, Y. (2018). Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 6546-6555).
[43]Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6299-6308).
[44]C. Wang, Y. Wang, and A. L. Yuille, “Mining 3-D key-pose-motifs for action recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2016, pp. 2639–2647.
[45]R. Anirudh, P. Turaga, J. Su, and A. Srivastava,“Elastic functional coding of human actions: From vector-fields to latent variables,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 3147–3155.
[46]J. Liu, A. Shahroudy, D. Xu, A. K. Chichung, and G. Wang, “Skeletonbased action recognition using spatio-temporal LSTM network with trust gates,” IEEE Trans. Pattern Anal. Mach. Intell., 2017.
[47]S. Zhang, X. Liu, and J. Xiao, “On geometric features for skeleton-based action recognition using multilayer LSTM networks,” in Proc. IEEE Winter Conf. Appl. Comput. Vis., Mar. 2017, pp. 148–157.
[48]Z. Liu, C. Zhang, Y. Tian, “3D-based deep convolutional neural network for action recognition with depth sequences,” in Image Vis. Comput., vol. 55, pp. 93–100, 2015.
[49]Verma, K. K., & Singh, B. M. (2021). Deep multi-model fusion for human activity recognition using evolutionary algorithms.

International Journal of Information Technology and Computer Science (IJITCS)