Optimized Image Captioning: Hybrid Transformers Vision Transformers and Convolutional Neural Networks: Enhanced with Beam Search

PDF (461KB), PP.53-61

Views: 0 Downloads: 0


Sushma Jaiswal 1,* Harikumar Pallthadka 2 Rajesh P. Chinchewadi 2 Tarun Jaiswal 3

1. Guru Ghasidas Central University, Bilaspur (C.G.) and Post-Doctoral Research Fellow, Manipur International University, Imphal, Manipur, India

2. Manipur International University, Imphal, Manipur, India

3. National Institute of Technology (NIT), Raipur (C.G.), India

* Corresponding author.

DOI: https://doi.org/10.5815/ijisa.2024.02.05

Received: 11 Dec. 2023 / Revised: 11 Jan. 2024 / Accepted: 20 Feb. 2024 / Published: 8 Apr. 2024

Index Terms

ResNet101, Self-attention, Image Caption, ViT and CNN, Beam Search


Deep learning has improved image captioning. Transformer, a neural network architecture built for natural language processing, excels at image captioning and other computer vision applications. This paper reviews Transformer-based image captioning methods in detail. Convolutional neural networks (CNNs) extracted image features and RNNs or LSTM networks generated captions in traditional image captioning. This method often has information bottlenecks and trouble capturing long-range dependencies. Transformer architecture revolutionized natural language processing with its attention strategy and parallel processing. Researchers used Transformers' language success to solve image captioning problems. Transformer-based image captioning systems outperform previous methods in accuracy and efficiency by integrating visual and textual information into a single model. This paper discusses how the Transformer architecture's self-attention mechanisms and positional encodings are adapted for image captioning. Vision Transformers (ViTs) and CNN-Transformer hybrid models are discussed. We also discuss pre-training, fine-tuning, and reinforcement learning to improve caption quality. Transformer-based image captioning difficulties, trends, and future approaches are also examined. Multimodal fusion, visual-text alignment, and caption interpretability are challenges. We expect research to address these issues and apply Transformer-based image captioning to medical imaging and distant sensing. This paper covers how Transformer-based approaches have changed image captioning and their potential to revolutionize multimodal interpretation and generation, advancing artificial intelligence and human-computer interactions.

Cite This Paper

Sushma Jaiswal, Harikumar Pallthadka, Rajesh P. Chinchewadi, Tarun Jaiswal, "Optimized Image Captioning: Hybrid Transformers Vision Transformers and Convolutional Neural Networks: Enhanced with Beam Search", International Journal of Intelligent Systems and Applications(IJISA), Vol.16, No.2, pp.53-61, 2024. DOI:10.5815/ijisa.2024.02.05


[1]Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I., “Attention is all you need”, Advances in neural information processing systems 30, 2017. 
[2]Hossain, M.Z., Sohel, F., Shiratuddin, M.F., & Laga, H.,“A Comprehensive Survey of Deep Learning for Image Captioning”, ACM Computing Surveys (CSUR) 51, ACM Computing Surveys, Volume 51, Issue 6, Article No.: 118, pp 1–36, 2018. https://doi.org/10.1145/3295748
[3]Hochreiter, S., & Schmidhuber, J., “Long Short-Term Memory”, Neural Computation, Volume 9, Issue 8, November 15, pp 1735–1780, 1997 https://doi.org/10.1162/neco.1997.9.8.1735.
[4]Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N., “An image is worth 16x16 words: Transformers for image recognition at scale”, arXiv preprint arXiv: 2010, 11929, 2020.
[5]Liu, W., Chen, S., Guo, L., Zhu, X., & Liu, J., “CPTR: Full transformer network for image captioning”, arXiv preprint arXiv: 2101, 10804, 2021.
[6]Yin, W., Lu, P., Zhao, Z., & Peng, X., “Yes, “Attention Is All You Need”, for Exemplar based Colorization”, In Proceedings of the 29th ACM international conference on multimedia, pp. 2243-2251, 2021.
[7]Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A.C., Salakhutdinov, R., Zemel, R.S., & Bengio, Y., “Show, attend and tell: Neural image caption generation with visual attention”, In International conference on machine learning, pp. 2048-2057. PMLR, 2015.
[8]Lin, T., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L., “Microsoft coco: Common objects in context”, In Computer Vision–ECCV 2014, 13th European Conference, Zurich, Switzerland, September 6-12, Proceedings, Part V 13, pp. 740-755. Springer International Publishing, 2014.
[9]Papineni, K., Roukos, S., Ward, T., & Zhu, W., “Bleu: a method for automatic evaluation of machine translation”, In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311-318, 2002.
[10]SBanerjee, S., & Lavie, A., “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments”, In Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65-72, 2005.
[11]Mutton, A., Dras, M., Wan, S., & Dale, R., “GLEU: Automatic evaluation of sentence-level fluency”, In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 344-351, 2007.
[12]Pennington, J., Socher, R., & Manning, C.D., “Glove: Global vectors for word representation”, In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532-1543, 2014.
[13]Karpathy, A., & Fei-Fei, L., “Deep visual-semantic alignments for generating image descriptions”, In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128-3137, 2015.