Transformer-Based Backbones for Scene Graph Generation A Comparative Analysis

Essam, Mohammad; Khattab, Dina; shedeed, howida; Tolba, Mohamed

doi:10.21608/ijicis.2024.301597.1342

	Transformer-Based Backbones for Scene Graph Generation A Comparative Analysis
International Journal of Intelligent Computing and Information Sciences
Volume 24, Issue 3, September 2024, Pages 1-10 PDF (491.51 K)
Document Type: Original Article
DOI: 10.21608/ijicis.2024.301597.1342
Authors
Mohammad Essam^* ¹; Dina Khattab²; howida shedeed³; Mohamed Tolba⁴
¹Faculty of computer and information sciences ain shams university
²Scientific Computing Department, Faculty of Computer & Information Sciences, Ain Shams University, Cairo, Egypt
³FCIS - Ain Shams Univ.
⁴Department of Scientific Computing, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, 11566, Egypt
Abstract
The Scene Graph is a modern structured representation of an image scene that explicitly describes the scene as a set of objects, attributes, and links between the objects (relationships). With the great advancements in the computer vision field, researchers dedicated their efforts towards more complex reasoning and a high level of understanding of visual scenes. Tasks like Visual Question Answering, image generation, and cross-modal retrieval are examples of Complex vision tasks that require a high level of visual scene understanding. Scene Graph is an effective data structure that highlights complex visual relationships presented in a scene. In this work, we provide a comparative analysis of Scene Graph Generation (SGG) backbone models. The contributed work aims to compare the Convolution Neural Networks (CNN) backbones and the vision transformer-based backbones using the RelTR model. The conducted analysis proved that both SwiftFormer L3 and MiT-B2 transformer backbones increased the model performance over the ResNet50 CNN backbone by 2.1 % and 2.5% Recall@50 respectively when experimented on the same Visual Genome 50 test split. The Visual Genome 50 is a tailored version of The Visual Genome dataset. It contains only the 50 most common relationships and the most frequent 150 object classes.
Keywords
Scene Graph; Scene Graph Generation; Transformer-Based Backbone; Visual Relationship Detection; Low Resolution

Statistics Article View: 372 PDF Download: 263