Ara-RATGAN for Arabic Text to Image Synthesis

Samy, Mostafa Samy; Al-Berry, Mariam Nabil; Bahgat, Sayed Fadel

doi:10.21608/ijicis.2025.392458.1400

	Ara-RATGAN for Arabic Text to Image Synthesis
International Journal of Intelligent Computing and Information Sciences
Volume 25, Issue 2, June 2025, Page 63-73 PDF (661.94 K)
Document Type: Original Article
DOI: 10.21608/ijicis.2025.392458.1400
View on SCiNiTO
Authors
Mostafa Samy Samy ¹; Mariam Nabil Al-Berry²; Sayed Fadel Bahgat³
¹146 Madint Naser,Madint Al_Tawfeq,Cairo,Egypt
²Ain Shams University
³Prof., Scientific Computing Department, Faculty of Computers and Information Sciences Ain Shams University, Cairo, Egypt
Abstract
Current text-to-image systems have revealed outstanding performance in tasks requiring the automated synthesis of realistic generated images from text descriptions. Previous approaches typically employ multiple sperate fusion blocks to adaptively fuse appropriate text information into the generation process, which increases the difficulty of training and conflict with one another. To solve these concerns, we present Arabic Recurrent Affine Transformation (Ara-RATGAN) a novel framework that integrates AraBERT -- a pretrained Arabic BERT that has been trained on billions of Arabic words to generate robust Arabic sentences embedding with Recurrent Affine Transformation (RAT) to generate images with high-quality from Arabic-language text descriptions. Furthermore, a spatial attention model is used in the discriminator to promote semantic coherence between text and synthesized images, which identifies corresponding image areas, and directs the generator to produce more appropriate visual contents according to the Arabic text descriptions. We conducted our extensive experiments on Arabic CUB dataset translated from English to Arabic, which shows a superior performance of our proposed model in comparison to the previous Arabic text-to-image models. Our approach addresses two key challenges: (1) Text-Image Fusion: Unlike traditional methods that use isolated fusion blocks, we employ RAT to model long-term dependencies across layers, ensuring global consistency in text conditioning. (2) Semantic Alignment: A spatial attention mechanism is used in the discriminator to enhance the semantic coherence between the synthetic visuals and Arabic text.
Keywords
AraBERT; Generative Adversarial Networks (GANs); Text-to-Image; Feature Fusion


Statistics Article View: 51 PDF Download: 36