347 research outputs found
Top-down Attention Recurrent VLAD Encoding for Action Recognition in Videos
Most recent approaches for action recognition from video leverage deep
architectures to encode the video clip into a fixed length representation
vector that is then used for classification. For this to be successful, the
network must be capable of suppressing irrelevant scene background and extract
the representation from the most discriminative part of the video. Our
contribution builds on the observation that spatio-temporal patterns
characterizing actions in videos are highly correlated with objects and their
location in the video. We propose Top-down Attention Action VLAD (TA-VLAD), a
deep recurrent architecture with built-in spatial attention that performs
temporally aggregated VLAD encoding for action recognition from videos. We
adopt a top-down approach of attention, by using class specific activation maps
obtained from a deep CNN pre-trained for image classification, to weight
appearance features before encoding them into a fixed-length video descriptor
using Gated Recurrent Units. Our method achieves state of the art recognition
accuracy on HMDB51 and UCF101 benchmarks.Comment: Accepted to the 17th International Conference of the Italian
Association for Artificial Intelligenc
LSTA: Long Short-Term Attention for Egocentric Action Recognition
Egocentric activity recognition is one of the most challenging tasks in video
analysis. It requires a fine-grained discrimination of small objects and their
manipulation. While some methods base on strong supervision and attention
mechanisms, they are either annotation consuming or do not take spatio-temporal
patterns into account. In this paper we propose LSTA as a mechanism to focus on
features from spatial relevant parts while attention is being tracked smoothly
across the video sequence. We demonstrate the effectiveness of LSTA on
egocentric activity recognition with an end-to-end trainable two-stream
architecture, achieving state of the art performance on four standard
benchmarks.Comment: Accepted to CVPR 201
VideoGraph: Recognizing Minutes-Long Human Activities in Videos
Many human activities take minutes to unfold. To represent them, related
works opt for statistical pooling, which neglects the temporal structure.
Others opt for convolutional methods, as CNN and Non-Local. While successful in
learning temporal concepts, they are short of modeling minutes-long temporal
dependencies. We propose VideoGraph, a method to achieve the best of two
worlds: represent minutes-long human activities and learn their underlying
temporal structure. VideoGraph learns a graph-based representation for human
activities. The graph, its nodes and edges are learned entirely from video
datasets, making VideoGraph applicable to problems without node-level
annotation. The result is improvements over related works on benchmarks:
Epic-Kitchen and Breakfast. Besides, we demonstrate that VideoGraph is able to
learn the temporal structure of human activities in minutes-long videos
VIBE: Video Inference for Human Body Pose and Shape Estimation
Human motion is fundamental to understanding behavior. Despite progress on
single-image 3D pose and shape estimation, existing video-based
state-of-the-art methods fail to produce accurate and natural motion sequences
due to a lack of ground-truth 3D motion data for training. To address this
problem, we propose Video Inference for Body Pose and Shape Estimation (VIBE),
which makes use of an existing large-scale motion capture dataset (AMASS)
together with unpaired, in-the-wild, 2D keypoint annotations. Our key novelty
is an adversarial learning framework that leverages AMASS to discriminate
between real human motions and those produced by our temporal pose and shape
regression networks. We define a temporal network architecture and show that
adversarial training, at the sequence level, produces kinematically plausible
motion sequences without in-the-wild ground-truth 3D labels. We perform
extensive experimentation to analyze the importance of motion and demonstrate
the effectiveness of VIBE on challenging 3D pose estimation datasets, achieving
state-of-the-art performance. Code and pretrained models are available at
https://github.com/mkocabas/VIBE.Comment: CVPR-2020 camera ready. Code is available at
https://github.com/mkocabas/VIB
LSTA: Long Short-Term Attention for Egocentric Action Recognition
Egocentric activity recognition is one of the most challenging tasks in video analysis. It requires a fine-grained discrimination of small objects and their manipulation. While some methods base on strong supervision and attention mechanisms, they are either annotation consuming or do not take spatio-temporal patterns into account. In this paper we propose LSTA as a mechanism to focus on features from spatial relevant parts while attention is being tracked smoothly across the video sequence. We demonstrate the effectiveness of LSTA on egocentric activity recognition with an end-to-end trainable two-stream architecture, achieving state-of-the-art performance on four standard benchmarks
- …