Aggressive Action Estimation: A Comprehensive Review on Neural Network Based Human Segmentation and Action Recognition

Full Text (PDF, 392KB), PP.9-19

Views: 0 Downloads: 0


A. F. M. Saifuddin Saif 1,* Md. Akib Shahriar Khan 1 Abir Mohammad Hadi 1 Rahul Prashad Karmoker 1 Julian Gomes 1

1. American International University – Bangladesh (AIUB), Dhaka, Bangladesh

* Corresponding author.


Received: 11 Oct. 2018 / Revised: 5 Nov. 2018 / Accepted: 17 Dec. 2018 / Published: 8 Jan. 2019

Index Terms

Capsule Network, Neural Network, Image Segmentation, Flow Estimation, Action Recognition


Human action recognition has been a talked topic since machine vision was coined. With the advent of neural networks and deep learning methods, various architectures were suggested to address the problems within a context. Convolutional neural network has been the primary go-to architecture for image segmentation, flow estimation and action recognition in recent days. As the problem itself is an extended version of various sub-problems, such as frame segmentation, spatial and temporal feature extraction, motion modeling and action classification as a whole, some methods reviewed in this paper addressed sub-problems and some tried to address a single architecture to the action recognition problem. While being a success, convolution neural networks have drawbacks in its pooling methods. CapsNet, on the other hand, uses squashing function to determine the activation. Also it addresses spatiotemporal information with the normalized vector maps while CNN-based methods extracts feature map for spatial and temporal information and later augment them in a fusion layer for combining two separate feature maps. Critical review of papers provided in this work can contribute significantly in addressing human action recognition problem as a whole.

Cite This Paper

A. F. M. Saifuddin Saif, Md. Akib Shahriar Khan, Abir Mohammad Hadi, Rahul Prashad Karmoker, Joy Julian Gomes,"Aggressive Action Estimation: A Comprehensive Review on Neural Network Based Human Segmentation and Action Recognition", International Journal of Education and Management Engineering(IJEME), Vol.9, No.1, pp.9-19, 2019. DOI: 10.5815/ijeme.2019.01.02


[1] Datta, A., Shah, M., & Lobo, N. D. V. (2002). Person-on-person violence detection in video data. In Pattern Recognition, 2002. Proceedings. 16th International Conference on (Vol. 1, pp. 433-438). IEEE.

[2] Bagautdinov, T. M., Alahi, A., Fleuret, F., Fua, P., & Savarese, S. (2017, July). Social Scene Understanding: End-to-End Multi-Person Action Localization and Collective Activity Recognition. In CVPR (pp. 3425-3434).

[3] Xiao, Q., & Si, Y. (2017, December). Human action recognition using autoencoder. In Computer and Communications (ICCC), 2017 3rd IEEE International Conference on (pp. 1672-1675). IEEE.

[4] Zhu, A. Z., Yuan, L., Chaney, K., & Daniilidis, K. (2018). EV-FlowNet: Self-Supervised Optical Flow Estimation for Event-based Cameras. arXiv preprint arXiv:1802.06898.

[5] Jégou, S., Drozdzal, M., Vazquez, D., Romero, A., & Bengio, Y. (2017, July). The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on (pp. 1175-1183). IEEE.

[6] Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., & Baskurt, A. (2011, November). Sequential deep learning for human action recognition. In International Workshop on Human Behavior Understanding (pp. 29-39). Springer, Berlin, Heidelberg.

[7] Sun, L., Jia, K., Chen, K., Yeung, D. Y., Shi, B. E., & Savarese, S. (2017, October). Lattice Long Short-Term Memory for Human Action Recognition. In ICCV (pp. 2166-2175).

[8] Chen, J., Wu, J., Konrad, J., & Ishwar, P. (2017, March). Semi-coupled two-stream fusion convnets for action recognition at extremely low resolutions. In Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on (pp. 139-147). IEEE.

[9] Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., ... & Brox, T. (2015). Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766).

[10] Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., & Brox, T. (2017, July). Flownet 2.0: Evolution of optical flow estimation with deep networks. In IEEE conference on computer vision and pattern recognition (CVPR) (Vol. 2, p. 6).

[11] Pigou, L., Van Den Oord, A., Dieleman, S., Van Herreweghe, M., & Dambre, J. (2018). Beyond temporal pooling: Recurrence and temporal convolutions for gesture recognition in video. International Journal of Computer Vision, 126(2-4), 430-439.

[12] Luvizon, D. C., Picard, D., & Tabia, H. (2018, June). 2d/3d pose estimation and action recognition using multitask deep learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Vol. 2).

[13] Oyedotun, O. K., & Khashman, A. (2017). Deep learning in vision-based static hand gesture recognition. Neural Computing and Applications, 28(12), 3941-3951.

[14] Wang, Y., Long, M., Wang, J., & Philip, S. Y. (2017, July). Spatiotemporal Pyramid Network for Video Action Recognition. In CVPR (Vol. 6, p. 7).

[15] Rahmani, H., Mian, A., & Shah, M. (2018). Learning a deep model for human action recognition from novel viewpoints. IEEE transactions on pattern analysis and machine intelligence, 40(3), 667-681

[16] Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., & Zheng, N. (2017). View adaptive recurrent neural networks for high performance human action recognition from skeleton data. arXiv, no. Mar.

[17] Zhang, S., Liu, X., & Xiao, J. (2017, March). On geometric features for skeleton-based action recognition using multilayer lstm networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 148-157). IEEE.

[18] Li, C., Sun, S., Min, X., Lin, W., Nie, B., & Zhang, X. (2017, July). End-to-end learning of deep convolutional neural network for 3D human action recognition. In 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW) (pp. 609-612). IEEE.

[19] Li, C., Cui, Z., Zheng, W., Xu, C., Ji, R., & Yang, J. (2018). Action-Attending Graphic Neural Network. IEEE Transactions on Image Processing, 27(7), 3657-3670.

[20] Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4694-4702).

[21] Sabour, S., Frosst, N., & Hinton, G. E. (2017). Dynamic routing between capsules. In Advances in Neural Information Processing Systems (pp. 3856-3866).

[22] LaLonde, R., & Bagci, U. (2018). Capsules for Object Segmentation. arXiv preprint arXiv:1804.04241.

[23] Afshar, P., Mohammadi, A., & Plataniotis, K. N. (2018). Brain tumor type classification via capsule networks. arXiv preprint arXiv:1802.10200.

[24] Chen, M. Y., & Hauptmann, A. (2009). Mosift: Recognizing human actions in surveillance videos.

[25] Nievas, E. B., Suarez, O. D., García, G. B., & Sukthankar, R. (2011, August). Violence detection in video using computer vision techniques. In International conference on Computer analysis of images and patterns (pp. 332-339). Springer, Berlin, Heidelberg.

[26] Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013). DeepFlow: Large displacement optical flow with deep matching. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1385-1392).

[27] Pang, J., Sun, W., Ren, J. S., Yang, C., & Yan, Q. (2017, October). Cascade Residual Learning: A Two-Stage Convolutional Neural Network for Stereo Matching. In ICCV Workshops (Vol. 7, No. 8).

[28] Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91-99).

[29] Edwards, M., & Xie, X. (2016). Graph based convolutional neural network. arXiv preprint arXiv:1609.08965.