104 research outputs found
LAC: Latent Action Composition for Skeleton-based Action Segmentation
Skeleton-based action segmentation requires recognizing composable actions in
untrimmed videos. Current approaches decouple this problem by first extracting
local visual features from skeleton sequences and then processing them by a
temporal model to classify frame-wise actions. However, their performances
remain limited as the visual features cannot sufficiently express composable
actions. In this context, we propose Latent Action Composition (LAC), a novel
self-supervised framework aiming at learning from synthesized composable
motions for skeleton-based action segmentation. LAC is composed of a novel
generation module towards synthesizing new sequences. Specifically, we design a
linear latent space in the generator to represent primitive motion. New
composed motions can be synthesized by simply performing arithmetic operations
on latent representations of multiple input skeleton sequences. LAC leverages
such synthesized sequences, which have large diversity and complexity, for
learning visual representations of skeletons in both sequence and frame spaces
via contrastive learning. The resulting visual encoder has a high expressive
power and can be effectively transferred onto action segmentation tasks by
end-to-end fine-tuning without the need for additional temporal models. We
conduct a study focusing on transfer-learning and we show that representations
learned from pre-trained LAC outperform the state-of-the-art by a large margin
on TSU, Charades, PKU-MMD datasets.Comment: ICCV 202
Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos
Query-based moment retrieval aims to localize the most relevant moment in an
untrimmed video according to the given natural language query. Existing works
often only focus on one aspect of this emerging task, such as the query
representation learning, video context modeling or multi-modal fusion, thus
fail to develop a comprehensive system for further performance improvement. In
this paper, we introduce a novel Cross-Modal Interaction Network (CMIN) to
consider multiple crucial factors for this challenging task, including (1) the
syntactic structure of natural language queries; (2) long-range semantic
dependencies in video context and (3) the sufficient cross-modal interaction.
Specifically, we devise a syntactic GCN to leverage the syntactic structure of
queries for fine-grained representation learning, propose a multi-head
self-attention to capture long-range semantic dependencies from video context,
and next employ a multi-stage cross-modal interaction to explore the potential
relations of video and query contents. The extensive experiments demonstrate
the effectiveness of our proposed method.Comment: Accepted by SIGIR 2019 as a full pape
ADM-Loc: Actionness Distribution Modeling for Point-supervised Temporal Action Localization
This paper addresses the challenge of point-supervised temporal action
detection, in which only one frame per action instance is annotated in the
training set. Self-training aims to provide supplementary supervision for the
training process by generating pseudo-labels (action proposals) from a base
model. However, most current methods generate action proposals by applying
manually designed thresholds to action classification probabilities and
treating adjacent snippets as independent entities. As a result, these methods
struggle to generate complete action proposals, exhibit sensitivity to
fluctuations in action classification scores, and generate redundant and
overlapping action proposals. This paper proposes a novel framework termed
ADM-Loc, which stands for Actionness Distribution Modeling for point-supervised
action Localization. ADM-Loc generates action proposals by fitting a composite
distribution, comprising both Gaussian and uniform distributions, to the action
classification signals. This fitting process is tailored to each action class
present in the video and is applied separately for each action instance,
ensuring the distinctiveness of their distributions. ADM-Loc significantly
enhances the alignment between the generated action proposals and ground-truth
action instances and offers high-quality pseudo-labels for self-training.
Moreover, to model action boundary snippets, it enforces consistency in action
classification scores during training by employing Gaussian kernels, supervised
with the proposed loss functions. ADM-Loc outperforms the state-of-the-art
point-supervised methods on THUMOS14 and ActivityNet-v1.2 datasets
Hierarchical Attention Network for Action Segmentation
The temporal segmentation of events is an essential task and a precursor for
the automatic recognition of human actions in the video. Several attempts have
been made to capture frame-level salient aspects through attention but they
lack the capacity to effectively map the temporal relationships in between the
frames as they only capture a limited span of temporal dependencies. To this
end we propose a complete end-to-end supervised learning approach that can
better learn relationships between actions over time, thus improving the
overall segmentation performance. The proposed hierarchical recurrent attention
framework analyses the input video at multiple temporal scales, to form
embeddings at frame level and segment level, and perform fine-grained action
segmentation. This generates a simple, lightweight, yet extremely effective
architecture for segmenting continuous video streams and has multiple
application domains. We evaluate our system on multiple challenging public
benchmark datasets, including MERL Shopping, 50 salads, and Georgia Tech
Egocentric datasets, and achieves state-of-the-art performance. The evaluated
datasets encompass numerous video capture settings which are inclusive of
static overhead camera views and dynamic, ego-centric head-mounted camera
views, demonstrating the direct applicability of the proposed framework in a
variety of settings.Comment: Published in Pattern Recognition Letter
Selective Spatio-Temporal Aggregation Based Pose Refinement System: Towards Understanding Human Activities in Real-World Videos
Taking advantage of human pose data for understanding human activities has
attracted much attention these days. However, state-of-the-art pose estimators
struggle in obtaining high-quality 2D or 3D pose data due to occlusion,
truncation and low-resolution in real-world un-annotated videos. Hence, in this
work, we propose 1) a Selective Spatio-Temporal Aggregation mechanism, named
SST-A, that refines and smooths the keypoint locations extracted by multiple
expert pose estimators, 2) an effective weakly-supervised self-training
framework which leverages the aggregated poses as pseudo ground-truth instead
of handcrafted annotations for real-world pose estimation. Extensive
experiments are conducted for evaluating not only the upstream pose refinement
but also the downstream action recognition performance on four datasets, Toyota
Smarthome, NTU-RGB+D, Charades, and Kinetics-50. We demonstrate that the
skeleton data refined by our Pose-Refinement system (SSTA-PRS) is effective at
boosting various existing action recognition models, which achieves competitive
or state-of-the-art performance.Comment: WACV202
Self-Feedback DETR for Temporal Action Detection
Temporal Action Detection (TAD) is challenging but fundamental for real-world
video applications. Recently, DETR-based models have been devised for TAD but
have not performed well yet. In this paper, we point out the problem in the
self-attention of DETR for TAD; the attention modules focus on a few key
elements, called temporal collapse problem. It degrades the capability of the
encoder and decoder since their self-attention modules play no role. To solve
the problem, we propose a novel framework, Self-DETR, which utilizes
cross-attention maps of the decoder to reactivate self-attention modules. We
recover the relationship between encoder features by simple matrix
multiplication of the cross-attention map and its transpose. Likewise, we also
get the information within decoder queries. By guiding collapsed self-attention
maps with the guidance map calculated, we settle down the temporal collapse of
self-attention modules in the encoder and decoder. Our extensive experiments
demonstrate that Self-DETR resolves the temporal collapse problem by keeping
high diversity of attention over all layers.Comment: Accepted to ICCV 202
Spatiotemporal Event Graphs for Dynamic Scene Understanding
Dynamic scene understanding is the ability of a computer system to interpret
and make sense of the visual information present in a video of a real-world
scene. In this thesis, we present a series of frameworks for dynamic scene
understanding starting from road event detection from an autonomous driving
perspective to complex video activity detection, followed by continual learning
approaches for the life-long learning of the models. Firstly, we introduce the
ROad event Awareness Dataset (ROAD) for Autonomous Driving, to our knowledge
the first of its kind. Due to the lack of datasets equipped with formally
specified logical requirements, we also introduce the ROad event Awareness
Dataset with logical Requirements (ROAD-R), the first publicly available
dataset for autonomous driving with requirements expressed as logical
constraints, as a tool for driving neurosymbolic research in the area. Next, we
extend event detection to holistic scene understanding by proposing two complex
activity detection methods. In the first method, we present a deformable,
spatiotemporal scene graph approach, consisting of three main building blocks:
action tube detection, a 3D deformable RoI pooling layer designed for learning
the flexible, deformable geometry of the constituent action tubes, and a scene
graph constructed by considering all parts as nodes and connecting them based
on different semantics. In a second approach evolving from the first, we
propose a hybrid graph neural network that combines attention applied to a
graph encoding of the local (short-term) dynamic scene with a temporal graph
modelling the overall long-duration activity. Finally, the last part of the
thesis is about presenting a new continual semi-supervised learning (CSSL)
paradigm.Comment: PhD thesis, Oxford Brookes University, Examiners: Prof. Dima Damen
and Dr. Matthias Rolf, 183 page
DPHANet: Discriminative Parallel and Hierarchical Attention Network for Natural Language Video Localization
Natural Language Video Localization (NLVL) has
recently attracted much attention because of its practical significance.
However, the existing methods still face the following
challenges: 1) When the models learn intra-modal semantic
association, the temporal causal interaction information and contextual
semantic discriminative information are ignored, resulting
in the lack of intra-modal semantic context connection; 2) When
learning fusion representations, existing cross-modal interaction
modules lack hierarchical attention function to extract intermodal
similarity information and intra-modal self-correlation
information, resulting in insufficient cross-modal information
interaction; 3) When the loss function is optimized, the existing
models ignore the correlation of causal inference between the
start and end boundaries, resulting in inaccurate start and end
boundary calibrations. To conquer the above challenges, we
proposed a novel NLVL model, called Discriminative Parallel
and Hierarchical Attention Network (DPHANet). Specifically,
we emphasized the importance of temporal causal interaction
information and contextual semantic discriminative information
and correspondingly proposed a Discriminative Parallel Attention
Encoder (DPAE) module to infer and encode the above critical
information. Besides, to overcome the shortcomings of the existing
cross-modal interaction modules, we designed a Video-Query
Hierarchical Attention (VQHA) module, which can perform
cross-modal interaction and intra-modal self-correlation modeling
in a hierarchical manner. Furthermore, a novel deviation
loss function was proposed to capture the correlation of causal
inference between the start and end boundaries and force the
model to focus on the continuity and temporal causality in
the video. Finally, extensive experiments on three benchmark
datasets demonstrated the superiority of our proposed DPHANet
model, which has achieved about 1.5% and 3.5% average
performance improvement and about 2.5% and 7.5% maximum
performance improvement on the Charades-STA and TACoS
datasets respectively
- …