Human-Object Interaction (HOI) detection plays a vital role in scene
understanding, which aims to predict the HOI triplet in the form of <human,
object, action>. Existing methods mainly extract multi-modal features (e.g.,
appearance, object semantics, human pose) and then fuse them together to
directly predict HOI triplets. However, most of these methods focus on seeking
for self-triplet aggregation, but ignore the potential cross-triplet
dependencies, resulting in ambiguity of action prediction. In this work, we
propose to explore Self- and Cross-Triplet Correlations (SCTC) for HOI
detection. Specifically, we regard each triplet proposal as a graph where
Human, Object represent nodes and Action indicates edge, to aggregate
self-triplet correlation. Also, we try to explore cross-triplet dependencies by
jointly considering instance-level, semantic-level, and layout-level relations.
Besides, we leverage the CLIP model to assist our SCTC obtain interaction-aware
feature by knowledge distillation, which provides useful action clues for HOI
detection. Extensive experiments on HICO-DET and V-COCO datasets verify the
effectiveness of our proposed SCTC