Transformer-Based Backbones for Scene Graph Generation A Comparative Analysis | ||
International Journal of Intelligent Computing and Information Sciences | ||
Volume 24, Issue 3, September 2024, Pages 1-10 PDF (491.51 K) | ||
Document Type: Original Article | ||
DOI: 10.21608/ijicis.2024.301597.1342 | ||
Authors | ||
Mohammad Essam* 1; Dina Khattab2; howida shedeed3; Mohamed Tolba4 | ||
1Faculty of computer and information sciences ain shams university | ||
2Scientific Computing Department, Faculty of Computer & Information Sciences, Ain Shams University, Cairo, Egypt | ||
3FCIS - Ain Shams Univ. | ||
4Department of Scientific Computing, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, 11566, Egypt | ||
Abstract | ||
The Scene Graph is a modern structured representation of an image scene that explicitly describes the scene as a set of objects, attributes, and links between the objects (relationships). With the great advancements in the computer vision field, researchers dedicated their efforts towards more complex reasoning and a high level of understanding of visual scenes. Tasks like Visual Question Answering, image generation, and cross-modal retrieval are examples of Complex vision tasks that require a high level of visual scene understanding. Scene Graph is an effective data structure that highlights complex visual relationships presented in a scene. In this work, we provide a comparative analysis of Scene Graph Generation (SGG) backbone models. The contributed work aims to compare the Convolution Neural Networks (CNN) backbones and the vision transformer-based backbones using the RelTR model. The conducted analysis proved that both SwiftFormer L3 and MiT-B2 transformer backbones increased the model performance over the ResNet50 CNN backbone by 2.1 % and 2.5% Recall@50 respectively when experimented on the same Visual Genome 50 test split. The Visual Genome 50 is a tailored version of The Visual Genome dataset. It contains only the 50 most common relationships and the most frequent 150 object classes. | ||
Keywords | ||
Scene Graph; Scene Graph Generation; Transformer-Based Backbone; Visual Relationship Detection; Low Resolution | ||
Statistics Article View: 287 PDF Download: 241 |