Video Visual Relation Detection (VidVRD) aims to detect visual relationship
triplets in videos using spatial bounding boxes and temporal boundaries.
Existing VidVRD methods can be broadly categorized into bottom-up and top-down
paradigms, depending on their approach to classifying relations. Bottom-up
methods follow a clip-based approach where they classify relations of short
clip tubelet pairs and then merge them into long video relations. On the other
hand, top-down methods directly classify long video tubelet pairs. While recent
video-based methods utilizing video tubelets have shown promising results, we
argue that the effective modeling of spatial and temporal context plays a more
significant role than the choice between clip tubelets and video tubelets. This
motivates us to revisit the clip-based paradigm and explore the key success
factors in VidVRD. In this paper, we propose a Hierarchical Context Model (HCM)
that enriches the object-based spatial context and relation-based temporal
context based on clips. We demonstrate that using clip tubelets can achieve
superior performance compared to most video-based methods. Additionally, using
clip tubelets offers more flexibility in model designs and helps alleviate the
limitations associated with video tubelets, such as the challenging long-term
object tracking problem and the loss of temporal information in long-term
tubelet feature compression. Extensive experiments conducted on two challenging
VidVRD benchmarks validate that our HCM achieves a new state-of-the-art
performance, highlighting the effectiveness of incorporating advanced spatial
and temporal context modeling within the clip-based paradigm