Ahsan Habib; Deloara Khushi; Masud Rana

Text-to-Image Synthesis Using MoCoGAN with Attention Mechanisms: A Unified Approach to Semantic and Dynamic Visual Representation

PDF (2400KB), PP.145-180

Views: 0 Downloads: 0

Author(s)

Ahsan Habib ^1,* Deloara Khushi ² Masud Rana ²

1. Department of Software Engineering, University of Frontier Technology, Bangladesh, Gazipur, Bangladesh

2. Department of Cyber Security Engineering, University of Frontier Technology, Bangladesh, Gazipur, Bangladesh

* Corresponding author.

DOI: https://doi.org/10.5815/ijem.2026.03.10

Received: 25 Feb. 2026 / Revised: 23 Mar. 2026 / Accepted: 12 Apr. 2026 / Published: 8 Jun. 2026

Index Terms

Text-to-image synthesis, MoCoGAN, Computer Vision, Natural Language Processing, Attention mechanism, Consistency

Abstract

Generating realistic images from textual descriptions remains a core challenge in artificial intelligence, with broad applications in assistive technology, virtual environments, and creative media. Existing text-to-image synthesis models often struggle with fine-grained semantic alignment and motion-aware scene generation, particularly in dynamic or complex prompts. This paper presents MoCoGAN+ATT, an enhanced framework that extends the MoCoGAN architecture by integrating attention mechanisms and Bidirectional Encoder Representations from Transformers (BERT) to extract and align rich semantic features from text. The attention module enables precise correspondence between textual concepts and visual components, leading to semantically faithful and visually coherent image generation. We evaluate MoCoGAN+ATT on five benchmark datasets—COCO, CUB-200-2011, Oxford-102 Flowers, MSR-VTT, and Visual Genome—demonstrating notable improvements over existing baselines. Specifically, on the COCO dataset, the proposed model achieved an Inception Score of 28.71, FID of 11.91, and R-Precision of 94.92; on CUB-200-2011, it obtained 27.36, 12.72, and 93.53 respectively; on Oxford-102 Flowers, the model achieved 28.63 (IS), 14.53 (FID), and 73.78 (R-Precision); on MSR-VTT, results were 28.01, 12.62, and 96.43; and on Visual Genome, we recorded 28.15, 17.93, and 94.52. The key novelty of this work lies in fusing motion-aware generative modeling with fine-grained attention-guided textual conditioning for dynamic image synthesis. These results highlight the effectiveness of combining attention-based textual conditioning with motion-aware generative modeling and point toward promising future directions for advancing multimodal image generation.

Cite This Paper

Ahsan Habib, Deloara Khushi, Masud Rana, "Text-to-Image Synthesis Using MoCoGAN with Attention Mechanisms: A Unified Approach to Semantic and Dynamic Visual Representation", International Journal of Engineering and Manufacturing (IJEM), Vol.16, No.3, pp.145-180, 2026. DOI:10.5815/ijem.2026.03.10

Reference

[1]Maciej Z˙ elaszczyk and Jacek Man´dziuk. Text-to-image cross-modal generation: A systematic review. arXiv preprint arXiv:2401.11631, 2024. https://doi.org/10.48550/arXiv.2401.11631.
[2]Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in Neural Information Processing Systems, 27, 2014. https://proceedings.neurips.cc/paper/2014/hash/ f033ed80deb0234979a61f95710dbe25-Abstract.html.
[3]Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. https://doi.org/10.48550/arXiv.1312.6114.
[4]Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020. https://proceedings.neurips.cc/paper/2020/ hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html.
[5]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019. https:// arxiv.org/abs/1810.04805.
[6]Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI blog, 1(8), 2019. https://openai.com/research/ better-language-models.
[7]Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021. https://doi.org/10.48550/arXiv.2103.00020.
[8]Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092, 2021. https://doi.org/ 10.48550/arXiv.2102.12092.
[9]Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Ghasemipour, Raphael Lopes, Petar Velickovic, Sanja Fidler, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022. https://doi.org/10.48550/arXiv. 2205.11487.
[10]Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjo¨rn Ommer. High-resolution image synthesis with latent diffusion models. arXiv preprint arXiv:2112.10752, 2022. https://doi.org/10.48550/ arXiv.2112.10752.
[11]Han Zhang et al. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv preprint arXiv:1612.03242, 2017. https://doi.org/10.48550/arXiv.1612.03242.
[12]Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. arXiv preprint arXiv:1711.10485, 2018. https://doi.org/10.48550/arXiv.1711.10485.
[13]Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1526–1535, 2018. https://doi.org/10.1109/CVPR.2018.00165.
[14]Jay Masiwal, Pritish Sinha, Sparsh Choudhary, and Seema Rawat. Architectural and performance analysis of text-to-image and text-to-video generative models. In International Conference on Data Analytics & Management, pages 419–432. Springer, 2025. https://doi.org/10.1007/978-3-031-XXXX-X.
[15]Jiapeng Zhu, Ceyuan Yang, Kecheng Zheng, Yinghao Xu, Zifan Shi, Yifei Zhang, Qifeng Chen, and Yujun Shen. Exploring sparse moe in gans for text-conditioned image synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 18411–18423, 2025. https://doi.org/10.1109/CVPR.2025.XXXXX.
[16]Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015. https://doi.org/10.48550/ arXiv.1511.06434.
[17]Scott Reed, Zeynep Akata, Honglak Yan, Lisa Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In International Conference on Machine Learning, pages 1060–1069. PMLR, 2016. https://proceedings.mlr.press/v48/reed16.html.
[18]Dustin Podell et al. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. https://doi.org/10.48550/arXiv.2307.01952.
[19]Lvmin Zhang et al. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023. https://doi.org/10.48550/arXiv.2302.05543.
[20]Junnan Li et al. Blip-2: Bootstrapping language-image pre-training with frozen image encoders. arXiv preprint arXiv:2301.12597, 2023. https://doi.org/10.48550/arXiv.2301.12597.
[21]Yang Song et al. Consistency models. arXiv preprint arXiv:2303.01469, 2023. https://doi.org/10.48550/ arXiv.2303.01469.
[22]Vamsidhar Talasila. Bi-lstm based encoding and gan for text-to-image synthesis. Multimedia Tools and Applications, 2022. https://doi.org/10.1007/s11220-022-00390-6.
[23]Quynh Phung, Songwei Ge, and Jia-Bin Huang. Grounded text-to-image synthesis with attention refocusing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7932–7942, 2024. https://doi.org/10.1109/CVPR52733.2024.00768.
[24]Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations (ICLR), 2018. https://openreview.net/forum?id=B1QRgziT-.
[25]Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems (NeurIPS), volume 30, 2017. https://proceedings.neurips.cc/paper/2017/hash/ 8a1d694707eb0fefe65871369074926e-Abstract.html.
[26]Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dolla´r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014. https://doi.org/10.1007/978-3-319-10602-1_48.
[27]Carl Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. Technical report, California Institute of Technology, 2011. https://www.vision.caltech.edu/ datasets/cub_200_2011/.
[28]Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008. https://doi.org/10.1109/ICVGIP.2008.47.
[29]Jun Xu, Tao Mei, Ting Yao, Yong Rui, Yong Liu, and Yizhou Zhu. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5288–5296, 2016. https://doi.org/10.1109/CVPR.2016.545.
[30]Ranjay Krishna, Yuke Zhu, Olivia Groth, Justin Johnson, Kenji Hata, Josh Kravitz, Stephanie Chen, Yannis Kalan-tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. In International Journal of Computer Vision, volume 123, pages 32–73. Springer, 2017. https://doi.org/10.1007/s11263-016-0981-7.
[31]Hongxu Ye, Xiaodan Yang, Marcin Taka´c, Raj Sunderraman, and Shuiwang Ji. Improving text-to-image synthesis using contrastive learning. In British Machine Vision Conference, 2021. https://www.semanticscholar. org/paper/Improving-Text-to-Image-Synthesis-Using-Contrastive-Ye-Yang/e9e1be04853e73c8de76b919c78269eb5bc78668. Accessed: 2023-02-26.
[32]Eunhee Jeon, Kyungmin Kim, and Donghyeon Kim. Fa-gan: Feature-aware gan for text to image synthesis. arXiv preprint arXiv:2109.00907, 2021. https://doi.org/10.48550/arXiv.2109.00907.
[33]Charles Ding and Rohan Bhowmik. Artificial intelligence in multimedia content generation: A review of audio and video synthesis techniques. Journal of the Society for Information Display, 34(2):49–67, 2026. https: //doi.org/10.1002/jsid.2111.
[34]Haoyu Zhao, Jiaxi Gu, Shicong Wang, Tianyi Lu, Xing Zhang, Zuxuan Wu, Hang Xu, and Yu-Gang Jiang. Lstd: Long short-term temporal diffusion for video generation. IEEE Transactions on Multimedia, 2026. https:// doi.org/10.1109/TMM.2026.3651052.

International Journal of Engineering and Manufacturing (IJEM)