3 research outputs found
Single-Stage Visual Relationship Learning using Conditional Queries
Research in scene graph generation (SGG) usually considers two-stage models,
that is, detecting a set of entities, followed by combining them and labeling
all possible relationships. While showing promising results, the pipeline
structure induces large parameter and computation overhead, and typically
hinders end-to-end optimizations. To address this, recent research attempts to
train single-stage models that are computationally efficient. With the advent
of DETR, a set based detection model, one-stage models attempt to predict a set
of subject-predicate-object triplets directly in a single shot. However, SGG is
inherently a multi-task learning problem that requires modeling entity and
predicate distributions simultaneously. In this paper, we propose Transformers
with conditional queries for SGG, namely, TraCQ with a new formulation for SGG
that avoids the multi-task learning problem and the combinatorial entity pair
distribution. We employ a DETR-based encoder-decoder design and leverage
conditional queries to significantly reduce the entity label space as well,
which leads to 20% fewer parameters compared to state-of-the-art single-stage
models. Experimental results show that TraCQ not only outperforms existing
single-stage scene graph generation methods, it also beats many
state-of-the-art two-stage methods on the Visual Genome dataset, yet is capable
of end-to-end training and faster inference.Comment: Accepted to NeurIPS 202
Recommended from our members
Devil is in the Tails: Visual Relationship Recognition
Significant effort has been recently devoted to modelling visual relations. This has mostly addressed the design of architectures, typically by adding parameters and increasing model complexity. However, visual relation learning is a long-tailed problem, due to the combinatorial nature of joint reasoning about groups of objects. Increasing model complexity is, in general, ill-suited for long-tailed problems due to their tendency to over-fit. In this thesis, we explore an alternative hypothesis, denoted the Devil is in the Tails. Under this hypothesis, better performance is achieved by keeping the model simple but improving its ability to cope with long-tailed distributions. To test this hypothesis, we devise a new approach for training visual relationships models. This is based on an iterative decoupled training scheme, denoted Decoupled Training for Devil in the Tails (DT2). DT2 employs a novel sampling approach, Alternating Class-Balanced Sampling (ACBS), to capture the interplay between the long-tailed entity and predicate distributions of visual relations. Results show that, with extremely simple architecture, DT2-ACBS significantly outperforms much more complex state-of-the-art methods on scene graph generation tasks. This suggests that the development of sophisticated models must be considered in tandem with the long-tailed nature of the problem