93,359 research outputs found
Action Capsules: Human Skeleton Action Recognition
Due to the compact and rich high-level representations offered,
skeleton-based human action recognition has recently become a highly active
research topic. Previous studies have demonstrated that investigating joint
relationships in spatial and temporal dimensions provides effective information
critical to action recognition. However, effectively encoding global
dependencies of joints during spatio-temporal feature extraction is still
challenging. In this paper, we introduce Action Capsule which identifies
action-related key joints by considering the latent correlation of joints in a
skeleton sequence. We show that, during inference, our end-to-end network pays
attention to a set of joints specific to each action, whose encoded
spatio-temporal features are aggregated to recognize the action. Additionally,
the use of multiple stages of action capsules enhances the ability of the
network to classify similar actions. Consequently, our network outperforms the
state-of-the-art approaches on the N-UCLA dataset and obtains competitive
results on the NTURGBD dataset. This is while our approach has significantly
lower computational requirements based on GFLOPs measurements.Comment: 11 pages, 11 figure
Spatio-Temporal Modeling for Action Recognition in Videos
Technological innovation in the field of video action recognition drives the development of video-based real-world applications. This PhD thesis provides a new set of machine learning algorithms for processing videos efficiently, leading to outstanding results in human action
recognition in videos. First of all, two video representation extraction methods, Temporal Squeezed Pooling (TSP) and Pixel-Wise Temporal Projection (PWTP), are proposed in order to enhance the discriminative video feature learning abilities of Deep Neural Networks
(DNNs). TSP enables spatio-temporal modeling by temporally aggregating the information from long video frame sequences. PWTP is an improved version TSP, which filters out static appearance while performing information aggregation. Secondly, we discuss how to address the long-term dependency modeling problem of video DNNs. To this end, we develop two spatio-temporal attention mechanisms, Region-based Non-local (RNL) and Convolution Pyramid Attention (CPA). We devise an attention chain by connecting the RNL or CPA module to the Squeeze-Excitation (SE) operation. We demonstrate how the attention mechanisms can be embedded into deep networks to alleviate the optimization difficulty.
Finally, we are focused on tackling the problem of heavy computational cost in video models. To this end, we introduce the concept of busy-quiet video disentangling for exceedingly fast video modeling. We propose the Motion Band-Pass Module (MBPM) embedded into the Busy-Quiet Net (BQN) architecture to reduce videos’ information redundancy in the spatial and temporal dimensions. The BQN architecture is extremely lightweight while still performing better than other heavier models. Extensive experiments for all the proposed methods are provided on multiple video benchmarks, including UCF101, HMDB51, Kinetics400, Something-Something V1
UntrimmedNets for Weakly Supervised Action Recognition and Detection
Current action recognition methods heavily rely on trimmed videos for model
training. However, it is expensive and time-consuming to acquire a large-scale
trimmed video dataset. This paper presents a new weakly supervised
architecture, called UntrimmedNet, which is able to directly learn action
recognition models from untrimmed videos without the requirement of temporal
annotations of action instances. Our UntrimmedNet couples two important
components, the classification module and the selection module, to learn the
action models and reason about the temporal duration of action instances,
respectively. These two components are implemented with feed-forward networks,
and UntrimmedNet is therefore an end-to-end trainable architecture. We exploit
the learned models for action recognition (WSR) and detection (WSD) on the
untrimmed video datasets of THUMOS14 and ActivityNet. Although our UntrimmedNet
only employs weak supervision, our method achieves performance superior or
comparable to that of those strongly supervised approaches on these two
datasets.Comment: camera-ready version to appear in CVPR201
Hierarchical Attention Network for Action Segmentation
The temporal segmentation of events is an essential task and a precursor for
the automatic recognition of human actions in the video. Several attempts have
been made to capture frame-level salient aspects through attention but they
lack the capacity to effectively map the temporal relationships in between the
frames as they only capture a limited span of temporal dependencies. To this
end we propose a complete end-to-end supervised learning approach that can
better learn relationships between actions over time, thus improving the
overall segmentation performance. The proposed hierarchical recurrent attention
framework analyses the input video at multiple temporal scales, to form
embeddings at frame level and segment level, and perform fine-grained action
segmentation. This generates a simple, lightweight, yet extremely effective
architecture for segmenting continuous video streams and has multiple
application domains. We evaluate our system on multiple challenging public
benchmark datasets, including MERL Shopping, 50 salads, and Georgia Tech
Egocentric datasets, and achieves state-of-the-art performance. The evaluated
datasets encompass numerous video capture settings which are inclusive of
static overhead camera views and dynamic, ego-centric head-mounted camera
views, demonstrating the direct applicability of the proposed framework in a
variety of settings.Comment: Published in Pattern Recognition Letter
Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification
Recently, substantial research effort has focused on how to apply CNNs or
RNNs to better extract temporal patterns from videos, so as to improve the
accuracy of video classification. In this paper, however, we show that temporal
information, especially longer-term patterns, may not be necessary to achieve
competitive results on common video classification datasets. We investigate the
potential of a purely attention based local feature integration. Accounting for
the characteristics of such features in video classification, we propose a
local feature integration framework based on attention clusters, and introduce
a shifting operation to capture more diverse signals. We carefully analyze and
compare the effect of different attention mechanisms, cluster sizes, and the
use of the shifting operation, and also investigate the combination of
attention clusters for multimodal integration. We demonstrate the effectiveness
of our framework on three real-world video classification datasets. Our model
achieves competitive results across all of these. In particular, on the
large-scale Kinetics dataset, our framework obtains an excellent single model
accuracy of 79.4% in terms of the top-1 and 94.0% in terms of the top-5
accuracy on the validation set. The attention clusters are the backbone of our
winner solution at ActivityNet Kinetics Challenge 2017. Code and models will be
released soon.Comment: The backbone of the winner solution at ActivityNet Kinetics Challenge
201
DeepSignals: Predicting Intent of Drivers Through Visual Signals
Detecting the intention of drivers is an essential task in self-driving,
necessary to anticipate sudden events like lane changes and stops. Turn signals
and emergency flashers communicate such intentions, providing seconds of
potentially critical reaction time. In this paper, we propose to detect these
signals in video sequences by using a deep neural network that reasons about
both spatial and temporal information. Our experiments on more than a million
frames show high per-frame accuracy in very challenging scenarios.Comment: To be presented at the IEEE International Conference on Robotics and
Automation (ICRA), 201
- …