Research in scene graph generation (SGG) usually considers two-stage models,
that is, detecting a set of entities, followed by combining them and labeling
all possible relationships. While showing promising results, the pipeline
structure induces large parameter and computation overhead, and typically
hinders end-to-end optimizations. To address this, recent research attempts to
train single-stage models that are computationally efficient. With the advent
of DETR, a set based detection model, one-stage models attempt to predict a set
of subject-predicate-object triplets directly in a single shot. However, SGG is
inherently a multi-task learning problem that requires modeling entity and
predicate distributions simultaneously. In this paper, we propose Transformers
with conditional queries for SGG, namely, TraCQ with a new formulation for SGG
that avoids the multi-task learning problem and the combinatorial entity pair
distribution. We employ a DETR-based encoder-decoder design and leverage
conditional queries to significantly reduce the entity label space as well,
which leads to 20% fewer parameters compared to state-of-the-art single-stage
models. Experimental results show that TraCQ not only outperforms existing
single-stage scene graph generation methods, it also beats many
state-of-the-art two-stage methods on the Visual Genome dataset, yet is capable
of end-to-end training and faster inference.Comment: Accepted to NeurIPS 202