98 research outputs found
LAC: Latent Action Composition for Skeleton-based Action Segmentation
Skeleton-based action segmentation requires recognizing composable actions in
untrimmed videos. Current approaches decouple this problem by first extracting
local visual features from skeleton sequences and then processing them by a
temporal model to classify frame-wise actions. However, their performances
remain limited as the visual features cannot sufficiently express composable
actions. In this context, we propose Latent Action Composition (LAC), a novel
self-supervised framework aiming at learning from synthesized composable
motions for skeleton-based action segmentation. LAC is composed of a novel
generation module towards synthesizing new sequences. Specifically, we design a
linear latent space in the generator to represent primitive motion. New
composed motions can be synthesized by simply performing arithmetic operations
on latent representations of multiple input skeleton sequences. LAC leverages
such synthesized sequences, which have large diversity and complexity, for
learning visual representations of skeletons in both sequence and frame spaces
via contrastive learning. The resulting visual encoder has a high expressive
power and can be effectively transferred onto action segmentation tasks by
end-to-end fine-tuning without the need for additional temporal models. We
conduct a study focusing on transfer-learning and we show that representations
learned from pre-trained LAC outperform the state-of-the-art by a large margin
on TSU, Charades, PKU-MMD datasets.Comment: ICCV 202
Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos
Query-based moment retrieval aims to localize the most relevant moment in an
untrimmed video according to the given natural language query. Existing works
often only focus on one aspect of this emerging task, such as the query
representation learning, video context modeling or multi-modal fusion, thus
fail to develop a comprehensive system for further performance improvement. In
this paper, we introduce a novel Cross-Modal Interaction Network (CMIN) to
consider multiple crucial factors for this challenging task, including (1) the
syntactic structure of natural language queries; (2) long-range semantic
dependencies in video context and (3) the sufficient cross-modal interaction.
Specifically, we devise a syntactic GCN to leverage the syntactic structure of
queries for fine-grained representation learning, propose a multi-head
self-attention to capture long-range semantic dependencies from video context,
and next employ a multi-stage cross-modal interaction to explore the potential
relations of video and query contents. The extensive experiments demonstrate
the effectiveness of our proposed method.Comment: Accepted by SIGIR 2019 as a full pape
Hierarchical Attention Network for Action Segmentation
The temporal segmentation of events is an essential task and a precursor for
the automatic recognition of human actions in the video. Several attempts have
been made to capture frame-level salient aspects through attention but they
lack the capacity to effectively map the temporal relationships in between the
frames as they only capture a limited span of temporal dependencies. To this
end we propose a complete end-to-end supervised learning approach that can
better learn relationships between actions over time, thus improving the
overall segmentation performance. The proposed hierarchical recurrent attention
framework analyses the input video at multiple temporal scales, to form
embeddings at frame level and segment level, and perform fine-grained action
segmentation. This generates a simple, lightweight, yet extremely effective
architecture for segmenting continuous video streams and has multiple
application domains. We evaluate our system on multiple challenging public
benchmark datasets, including MERL Shopping, 50 salads, and Georgia Tech
Egocentric datasets, and achieves state-of-the-art performance. The evaluated
datasets encompass numerous video capture settings which are inclusive of
static overhead camera views and dynamic, ego-centric head-mounted camera
views, demonstrating the direct applicability of the proposed framework in a
variety of settings.Comment: Published in Pattern Recognition Letter
Selective Spatio-Temporal Aggregation Based Pose Refinement System: Towards Understanding Human Activities in Real-World Videos
Taking advantage of human pose data for understanding human activities has
attracted much attention these days. However, state-of-the-art pose estimators
struggle in obtaining high-quality 2D or 3D pose data due to occlusion,
truncation and low-resolution in real-world un-annotated videos. Hence, in this
work, we propose 1) a Selective Spatio-Temporal Aggregation mechanism, named
SST-A, that refines and smooths the keypoint locations extracted by multiple
expert pose estimators, 2) an effective weakly-supervised self-training
framework which leverages the aggregated poses as pseudo ground-truth instead
of handcrafted annotations for real-world pose estimation. Extensive
experiments are conducted for evaluating not only the upstream pose refinement
but also the downstream action recognition performance on four datasets, Toyota
Smarthome, NTU-RGB+D, Charades, and Kinetics-50. We demonstrate that the
skeleton data refined by our Pose-Refinement system (SSTA-PRS) is effective at
boosting various existing action recognition models, which achieves competitive
or state-of-the-art performance.Comment: WACV202
Self-Feedback DETR for Temporal Action Detection
Temporal Action Detection (TAD) is challenging but fundamental for real-world
video applications. Recently, DETR-based models have been devised for TAD but
have not performed well yet. In this paper, we point out the problem in the
self-attention of DETR for TAD; the attention modules focus on a few key
elements, called temporal collapse problem. It degrades the capability of the
encoder and decoder since their self-attention modules play no role. To solve
the problem, we propose a novel framework, Self-DETR, which utilizes
cross-attention maps of the decoder to reactivate self-attention modules. We
recover the relationship between encoder features by simple matrix
multiplication of the cross-attention map and its transpose. Likewise, we also
get the information within decoder queries. By guiding collapsed self-attention
maps with the guidance map calculated, we settle down the temporal collapse of
self-attention modules in the encoder and decoder. Our extensive experiments
demonstrate that Self-DETR resolves the temporal collapse problem by keeping
high diversity of attention over all layers.Comment: Accepted to ICCV 202
Frame-wise Cross-modal Matching for Video Moment Retrieval
Video moment retrieval targets at retrieving a moment in a video for a given
language query. The challenges of this task include 1) the requirement of
localizing the relevant moment in an untrimmed video, and 2) bridging the
semantic gap between textual query and video contents. To tackle those
problems, early approaches adopt the sliding window or uniform sampling to
collect video clips first and then match each clip with the query. Obviously,
these strategies are time-consuming and often lead to unsatisfied accuracy in
localization due to the unpredictable length of the golden moment. To avoid the
limitations, researchers recently attempt to directly predict the relevant
moment boundaries without the requirement to generate video clips first. One
mainstream approach is to generate a multimodal feature vector for the target
query and video frames (e.g., concatenation) and then use a regression approach
upon the multimodal feature vector for boundary detection. Although some
progress has been achieved by this approach, we argue that those methods have
not well captured the cross-modal interactions between the query and video
frames.
In this paper, we propose an Attentive Cross-modal Relevance Matching (ACRM)
model which predicts the temporal boundaries based on an interaction modeling.
In addition, an attention module is introduced to assign higher weights to
query words with richer semantic cues, which are considered to be more
important for finding relevant video contents. Another contribution is that we
propose an additional predictor to utilize the internal frames in the model
training to improve the localization accuracy. Extensive experiments on two
datasets TACoS and Charades-STA demonstrate the superiority of our method over
several state-of-the-art methods. Ablation studies have been also conducted to
examine the effectiveness of different modules in our ACRM model.Comment: 12 pages; accepted by IEEE TM
- …