Attention-Based Deep Learning Model for Image Captioning: A Comparative Study

Full Text (PDF, 532KB), PP.1-8

Views: 0 Downloads: 0


Phyu Phyu Khaing 1,* May TheYu 1

1. University of Computer Studies, Mandalay, Myanmar

* Corresponding author.


Received: 10 Mar. 2019 / Revised: 25 Apr. 2019 / Accepted: 22 May 2019 / Published: 8 Jun. 2019

Index Terms

Attention Mechanism, Deep Learning Model, Image Captioning


Image captioning is the description generated from images. Generating the caption of an image is one part of computer vision or image processing from artificial intelligence (AI). Image captioning is also the bridge between the vision process and natural language process. In image captioning, there are two parts: sentence based generation and single word generation. Deep Learning has become the main driver of many new applications and is also much more accessible in terms of the learning curve. Image captioning by applying deep learning model can enhance the description accuracy. Attention mechanisms are the upward trend in the model of deep learning for image caption generation. This paper proposes the comparative study for attention-based deep learning model for image captioning. This presents the basic analyzing techniques for performance, advantages, and weakness. This also discusses the datasets for image captioning and the evaluation metrics to test the accuracy. 

Cite This Paper

Phyu Phyu Khaing, May The` Yu, "Attention-Based Deep Learning Model for Image Captioning: A Comparative Study", International Journal of Image, Graphics and Signal Processing(IJIGSP), Vol.11, No.6, pp. 1-8, 2019. DOI: 10.5815/ijigsp.2019.06.01


[1]You, Quanzeng, et al., In Proceedings of the IEEE conference on computer vision and pattern recognition, “Image captioning with semantic attention”, 2016.

[2]D.J. Kim, D. Yoo, B. Sim, and I.S. Kweon, “Sentence learning on deep convolutional networks for image Caption Generation”, Ubiquitous Robots and Ambient Intelligence (URAI), 2016 13th International Conference on IEEE, pp. 246-247.

[3]L. Yang and H. Hu, “TVPRNN for image caption generation”, Electronics Letters, 53(22):1471-1473, 2017.

[4]A. Karpathy, and L. Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, IEEE Trans. on Pattern Analysis and Machine Intelligence, 39(4):664-676, 2017.

[5]B. Shi, X. Bai, and C. Yao, “An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition”, IEEE Trans. on Pattern Analysis and Machine Intelligence, 39(11):2298-2304, 2017.

[6]Park, Cesc Chunseong, Youngjin Kim, and Gunhee Kim. “Retrieval of sentence sequences for an image stream via coherence recurrent convolutional networks”, IEEE transactions on pattern analysis and machine intelligence 40.4 (2018): 945-957. 

[7]A. Wu, C. Shen, P. Wang, A. Dick, and A.v.d. Hengel, “Image Captioning and Visual Question Answering Based on Attributes and External Knowledge”, IEEE Trans. on Pattern Analysis And Machine Intelligence, 40(6):1367-1381, 2018.

[8]A. Ramisa, F. Yan, F. Moreno-Noguer, and K. Mikolajczyk, “Article Annotation by Image and Text Processing”, IEEE Trans. on Pattern Analysis and Machine Intelligence, 40(5):1072-1085, 2018.

[9]S. Shabir, and S.Y. Arafat, “An image conveys a message: A brief survey on image description generation”, 2018 1st International Conference on Power, Energy and Smart Grid (ICPESG), IEEE, pp. 1-6, April 2018.

[10]K. Cho, A. Courville, and Y. Bengio, “Describing Multimedia Content Using Attention-Based Encoder-Decoder Networks”, IEEE Trans. on Multimedia, 17(11):1875-1886, 2015.

[11]K. Xu, J.L. Ba, R. Kiros, K Cho, A. Courville, R. Salakhudinov, R. S.Zemel and Y. Bengio, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, International conference on machine learning, pp. 2048-2057, 2015.

[12]S. Li, M. Tang, A. Guo, J. Lei, and J. Zhang, “Deep Neural Network with Attention Model for Scene Text Recognition”, IET journals of the institution of Engineering and Technology, 11(7):605-612, 2017.

[13]K. Fu, J. Jin, R. Cui, F. Sha, and C. Zhang, “Aligning Where to see and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts”, IEEE Trans. on Pattern Analysis and Machine Intelligence, 39(12):2321-2334, 2017.

[14]L. Gao, Z. Guo, H. Zhang, X. Xu, and H.T. Shen, “Video Captioning With Attention-Based LSTM and Semantic Consistency”, IEEE Trans. on Multimedia, 19(9):2045-2055, 2017

[15]Y. Bin, Y. Yang, J. Zhou, Z. Huang, and H.T. Shen, “Adaptively Attending to Visual Attributes and Linguistic Knowledge for Captioning”, In Proceedings of the 2017 ACM on Multimedia Conference, pp. 1345-1353, 2017.

[16]S. Qu, Y. Xi, and S. Ding, “Visual Attention Based on Long-Short Term Memory Model for Image Caption Generation”, Control and Decision Conference (CCDC), 2017 29th Chinese, IEEE, pp. 4789-4794, May 2017.

[17]L. Li, S. Tang, Y. Zhang, L. Deng, and Q. Tian, “GLA: Global-Local Attention for Image Description”, IEEE Trans. on Multimedia, 20(3):726-737, 2018.

[18]S. Ye, J. Han, and N. Liu, “Attentive Linear Transformation for Image Captioning”, IEEE Trans. on Image Processing, 27(11):5514-5524, 2018. 

[19]Cornia, Marcella, et al. “Paying more attention to saliency: Image captioning with saliency and context attention”, ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14.2 (2018): 48.  

[20]A. Wang, H. Hu and L. Yang, “Image Captioning with Affective Guiding and Selective Attention”, ACM Trans. Multimedia Comput. Commun. Appl., 14(3):73, 2018.

[21]X. Zhu, L. Li, J. Liu, H. Peng, and X. Niu, “Captioning Transformer with Stacked Attention Model”, Applied Sciences, 8(5):739, 2018.

[22]T.Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C.L. Zitnick, and P. Dollar, “Microsoft COCO: Common Objects in Context”, European Conference on Computer Vision, pp. 740-755, 2014.

[23]Hodosh, Micah, Peter Young, and Julia Hockenmaier. “Framing image description as a ranking task: Data, models and evaluation metrics”, Journal of Artificial Intelligence Research 47 (2013): 853-899. 

[24]Plummer, Bryan A., et al., in Proceedings of the IEEE international conference on computer vision, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models”, 2015. 

[25]Papineni, Kishore, et al., in Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, “BLEU: a method for automatic evaluation of machine translation”, 2002. 

[26]Lin, Chin-Yew. “Rouge: A package for automatic evaluation of summaries”, Text Summarization Branches Out (2004). 

[27]M. Denkowshi, and A. Lavie, in Proceedings of the Ninth Workshop on Statistical Machine Translation, “Meteor Universal: Language Specific Translation Evaluation for Any Target Language”, 2014, pp. 376-380.

[28]R. Vedantam, C.L. Zitnick, and D. Parikh, in Proceedings of the IEEE conference on computer vision and pattern recognition, “CIDEr: Consensus-based Image Description Evaluation”, 2015, pp. 4566-4575.

[29]C. Wang, H. Yang, C. Bartz, and C. Meinel, In Proceedings of the 2016 ACM on Multimedia Conference, “Image captioning with deep bidirectional LSTMs”, 2016, pp. 988–997.

[30]M. Wang, L. Song, X. Yang, and C. Luo, “A parallel-fusion RNN-LSTM architecture for image caption generation”, In 2016 IEEE International Conference on Image Processing (ICIP’16), pp. 4448–4452, 2016.

[31]J Lu, Jiasen, et al., in Proceedings of the IEEE conference on computer vision and pattern recognition. “Knowing when to look: Adaptive attention via a visual sentinel for image captioning”, 2017.

[32]L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, and T.S. Chua, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), “SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning, 2017”, pp. 6298–6306.

[33]Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng, “Semantic compositional networks for visual captioning”, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). pp. 1141–1150, 2017.

[34]C. Liu, J. Mao, F. Sha, and A.L. Yuille, “Attention Correctness In Neural Image Captioning”, In AAAI, pp. 4176–4182, 2017.

[35]J. Gu, G. Wang, J. Cai, and T. Chen, “An Empirical Study Of Language CNN For Image Captioning”, In Proceedings of the International Conference on Computer Vision (ICCV’17), pp. 1231–1240, 2017.

[36]Q. Wu, C. Shen, P. Wang, A. Dick, and A.v.d. Hengel, “Image Captioning And Visual Question Answering Based On Attributes And External Knowledge”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.  40, no. 6, pp. 1367–1381, 2018.

[37]J. Aneja, A. Deshpande, and A.G. Schwing, “Convolutional Image Captioning”, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5561–5570, 2018

[38]A. Wang and A.B. Chan, “CNN+ CNN: Convolutional Decoders For Image Captioning”, arXiv preprint arXiv:1805.09019, 2018.

[39]O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 4, pp. 652–663, Apr. 2017.