31 research outputs found
In Defense of Clip-based Video Relation Detection
Video Visual Relation Detection (VidVRD) aims to detect visual relationship
triplets in videos using spatial bounding boxes and temporal boundaries.
Existing VidVRD methods can be broadly categorized into bottom-up and top-down
paradigms, depending on their approach to classifying relations. Bottom-up
methods follow a clip-based approach where they classify relations of short
clip tubelet pairs and then merge them into long video relations. On the other
hand, top-down methods directly classify long video tubelet pairs. While recent
video-based methods utilizing video tubelets have shown promising results, we
argue that the effective modeling of spatial and temporal context plays a more
significant role than the choice between clip tubelets and video tubelets. This
motivates us to revisit the clip-based paradigm and explore the key success
factors in VidVRD. In this paper, we propose a Hierarchical Context Model (HCM)
that enriches the object-based spatial context and relation-based temporal
context based on clips. We demonstrate that using clip tubelets can achieve
superior performance compared to most video-based methods. Additionally, using
clip tubelets offers more flexibility in model designs and helps alleviate the
limitations associated with video tubelets, such as the challenging long-term
object tracking problem and the loss of temporal information in long-term
tubelet feature compression. Extensive experiments conducted on two challenging
VidVRD benchmarks validate that our HCM achieves a new state-of-the-art
performance, highlighting the effectiveness of incorporating advanced spatial
and temporal context modeling within the clip-based paradigm
Zero-Shot Visual Recognition using Semantics-Preserving Adversarial Embedding Networks
We propose a novel framework called Semantics-Preserving Adversarial
Embedding Network (SP-AEN) for zero-shot visual recognition (ZSL), where test
images and their classes are both unseen during training. SP-AEN aims to tackle
the inherent problem --- semantic loss --- in the prevailing family of
embedding-based ZSL, where some semantics would be discarded during training if
they are non-discriminative for training classes, but could become critical for
recognizing test classes. Specifically, SP-AEN prevents the semantic loss by
introducing an independent visual-to-semantic space embedder which disentangles
the semantic space into two subspaces for the two arguably conflicting
objectives: classification and reconstruction. Through adversarial learning of
the two subspaces, SP-AEN can transfer the semantics from the reconstructive
subspace to the discriminative one, accomplishing the improved zero-shot
recognition of unseen classes. Comparing with prior works, SP-AEN can not only
improve classification but also generate photo-realistic images, demonstrating
the effectiveness of semantic preservation. On four popular benchmarks: CUB,
AWA, SUN and aPY, SP-AEN considerably outperforms other state-of-the-art
methods by an absolute performance difference of 12.2\%, 9.3\%, 4.0\%, and
3.6\% in terms of harmonic mean value
LIGHTEN: Learning Interactions with Graph and Hierarchical TEmporal Networks for HOI in videos
Analyzing the interactions between humans and objects from a video includes
identification of the relationships between humans and the objects present in
the video. It can be thought of as a specialized version of Visual Relationship
Detection, wherein one of the objects must be a human. While traditional
methods formulate the problem as inference on a sequence of video segments, we
present a hierarchical approach, LIGHTEN, to learn visual features to
effectively capture spatio-temporal cues at multiple granularities in a video.
Unlike current approaches, LIGHTEN avoids using ground truth data like depth
maps or 3D human pose, thus increasing generalization across non-RGBD datasets
as well. Furthermore, we achieve the same using only the visual features,
instead of the commonly used hand-crafted spatial features. We achieve
state-of-the-art results in human-object interaction detection (88.9% and
92.6%) and anticipation tasks of CAD-120 and competitive results on image based
HOI detection in V-COCO dataset, setting a new benchmark for visual features
based approaches. Code for LIGHTEN is available at
https://github.com/praneeth11009/LIGHTEN-Learning-Interactions-with-Graphs-and-Hierarchical-TEmporal-Networks-for-HOIComment: 9 pages, 6 figures, ACM Multimedia Conference 202
TD^2-Net: Toward Denoising and Debiasing for Dynamic Scene Graph Generation
Dynamic scene graph generation (SGG) focuses on detecting objects in a video
and determining their pairwise relationships. Existing dynamic SGG methods
usually suffer from several issues, including 1) Contextual noise, as some
frames might contain occluded and blurred objects. 2) Label bias, primarily due
to the high imbalance between a few positive relationship samples and numerous
negative ones. Additionally, the distribution of relationships exhibits a
long-tailed pattern. To address the above problems, in this paper, we introduce
a network named TD-Net that aims at denoising and debiasing for dynamic
SGG. Specifically, we first propose a denoising spatio-temporal transformer
module that enhances object representation with robust contextual information.
This is achieved by designing a differentiable Top-K object selector that
utilizes the gumbel-softmax sampling strategy to select the relevant
neighborhood for each object. Second, we introduce an asymmetrical reweighting
loss to relieve the issue of label bias. This loss function integrates
asymmetry focusing factors and the volume of samples to adjust the weights
assigned to individual samples. Systematic experimental results demonstrate the
superiority of our proposed TD-Net over existing state-of-the-art
approaches on Action Genome databases. In more detail, TD-Net outperforms
the second-best competitors by 12.7 \% on mean-Recall@10 for predicate
classification.Comment: Accepted by AAAI 202
Action Class Relation Detection and Classification Across Multiple Video Datasets
The Meta Video Dataset (MetaVD) provides annotated relations between action
classes in major datasets for human action recognition in videos. Although
these annotated relations enable dataset augmentation, it is only applicable to
those covered by MetaVD. For an external dataset to enjoy the same benefit, the
relations between its action classes and those in MetaVD need to be determined.
To address this issue, we consider two new machine learning tasks: action class
relation detection and classification. We propose a unified model to predict
relations between action classes, using language and visual information
associated with classes. Experimental results show that (i) pre-trained recent
neural network models for texts and videos contribute to high predictive
performance, (ii) the relation prediction based on action label texts is more
accurate than based on videos, and (iii) a blending approach that combines
predictions by both modalities can further improve the predictive performance
in some cases.Comment: Accepted to Pattern Recognition Letters. 12 pages, 4 figure
Multi-Label Meta Weighting for Long-Tailed Dynamic Scene Graph Generation
This paper investigates the problem of scene graph generation in videos with
the aim of capturing semantic relations between subjects and objects in the
form of subject, predicate, object triplets. Recognizing the
predicate between subject and object pairs is imbalanced and multi-label in
nature, ranging from ubiquitous interactions such as spatial relationships (\eg
\emph{in front of}) to rare interactions such as \emph{twisting}. In
widely-used benchmarks such as Action Genome and VidOR, the imbalance ratio
between the most and least frequent predicates reaches 3,218 and 3,408,
respectively, surpassing even benchmarks specifically designed for long-tailed
recognition. Due to the long-tailed distributions and label co-occurrences,
recent state-of-the-art methods predominantly focus on the most frequently
occurring predicate classes, ignoring those in the long tail. In this paper, we
analyze the limitations of current approaches for scene graph generation in
videos and identify a one-to-one correspondence between predicate frequency and
recall performance. To make the step towards unbiased scene graph generation in
videos, we introduce a multi-label meta-learning framework to deal with the
biased predicate distribution. Our meta-learning framework learns a meta-weight
network for each training sample over all possible label losses. We evaluate
our approach on the Action Genome and VidOR benchmarks by building upon two
current state-of-the-art methods for each benchmark. The experiments
demonstrate that the multi-label meta-weight network improves the performance
for predicates in the long tail without compromising performance for head
classes, resulting in better overall performance and favorable
generalizability. Code: \url{https://github.com/shanshuo/ML-MWN}.Comment: ICMR 202
MRTNet: Multi-Resolution Temporal Network for Video Sentence Grounding
Given an untrimmed video and natural language query, video sentence grounding
aims to localize the target temporal moment in the video. Existing methods
mainly tackle this task by matching and aligning semantics of the descriptive
sentence and video segments on a single temporal resolution, while neglecting
the temporal consistency of video content in different resolutions. In this
work, we propose a novel multi-resolution temporal video sentence grounding
network: MRTNet, which consists of a multi-modal feature encoder, a
Multi-Resolution Temporal (MRT) module, and a predictor module. MRT module is
an encoder-decoder network, and output features in the decoder part are in
conjunction with Transformers to predict the final start and end timestamps.
Particularly, our MRT module is hot-pluggable, which means it can be seamlessly
incorporated into any anchor-free models. Besides, we utilize a hybrid loss to
supervise cross-modal features in MRT module for more accurate grounding in
three scales: frame-level, clip-level and sequence-level. Extensive experiments
on three prevalent datasets have shown the effectiveness of MRTNet.Comment: work in progres
LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision
We propose LASER, a neuro-symbolic approach to learn semantic video
representations that capture rich spatial and temporal properties in video data
by leveraging high-level logic specifications. In particular, we formulate the
problem in terms of alignment between raw videos and spatio-temporal logic
specifications. The alignment algorithm leverages a differentiable symbolic
reasoner and a combination of contrastive, temporal, and semantics losses. It
effectively and efficiently trains low-level perception models to extract
fine-grained video representation in the form of a spatio-temporal scene graph
that conforms to the desired high-level specification. In doing so, we explore
a novel methodology that weakly supervises the learning of video semantic
representations through logic specifications. We evaluate our method on two
datasets with rich spatial and temporal specifications:
20BN-Something-Something and MUGEN. We demonstrate that our method learns
better fine-grained video semantics than existing baselines