Work place: Department of Cyber Security Engineering, University of Frontier Technology, Bangladesh, Gazipur, Bangladesh
E-mail: khushi0001@uftb.ac.bd
Website: https://orcid.org/0009-0007-4630-0067
Research Interests:
Biography
Mst Deloara Khushi is currently a Lecturer in the Department of Cyber Security Engineering, University of Frontier Technology, Bangladesh. With an M.Sc. in Computer Science and Engineering from Jagannath University (JnU), Bangladesh, and a B.Sc. in the same field from Hajee Mohammad Danesh Science and Technology University, Her current focus lies at the intersection of machine learning, particularly deep learning techniques, and the critical field of cyber security.
By Ahsan Habib Deloara Khushi Masud Rana
DOI: https://doi.org/10.5815/ijem.2026.03.10, Pub. Date: 8 Jun. 2026
Generating realistic images from textual descriptions remains a core challenge in artificial intelligence, with broad applications in assistive technology, virtual environments, and creative media. Existing text-to-image synthesis models often struggle with fine-grained semantic alignment and motion-aware scene generation, particularly in dynamic or complex prompts. This paper presents MoCoGAN+ATT, an enhanced framework that extends the MoCoGAN architecture by integrating attention mechanisms and Bidirectional Encoder Representations from Transformers (BERT) to extract and align rich semantic features from text. The attention module enables precise correspondence between textual concepts and visual components, leading to semantically faithful and visually coherent image generation. We evaluate MoCoGAN+ATT on five benchmark datasets—COCO, CUB-200-2011, Oxford-102 Flowers, MSR-VTT, and Visual Genome—demonstrating notable improvements over existing baselines. Specifically, on the COCO dataset, the proposed model achieved an Inception Score of 28.71, FID of 11.91, and R-Precision of 94.92; on CUB-200-2011, it obtained 27.36, 12.72, and 93.53 respectively; on Oxford-102 Flowers, the model achieved 28.63 (IS), 14.53 (FID), and 73.78 (R-Precision); on MSR-VTT, results were 28.01, 12.62, and 96.43; and on Visual Genome, we recorded 28.15, 17.93, and 94.52. The key novelty of this work lies in fusing motion-aware generative modeling with fine-grained attention-guided textual conditioning for dynamic image synthesis. These results highlight the effectiveness of combining attention-based textual conditioning with motion-aware generative modeling and point toward promising future directions for advancing multimodal image generation.
[...] Read more.Subscribe to receive issue release notifications and newsletters from MECS Press journals