10 research outputs found
Action recognition with spatial-temporal discriminative filter banks
Action recognition has seen a dramatic performance improvement in the last
few years. Most of the current state-of-the-art literature either aims at
improving performance through changes to the backbone CNN network, or they
explore different trade-offs between computational efficiency and performance,
again through altering the backbone network. However, almost all of these works
maintain the same last layers of the network, which simply consist of a global
average pooling followed by a fully connected layer. In this work we focus on
how to improve the representation capacity of the network, but rather than
altering the backbone, we focus on improving the last layers of the network,
where changes have low impact in terms of computational cost. In particular, we
show that current architectures have poor sensitivity to finer details and we
exploit recent advances in the fine-grained recognition literature to improve
our model in this aspect. With the proposed approach, we obtain
state-of-the-art performance on Kinetics-400 and Something-Something-V1, the
two major large-scale action recognition benchmarks.Comment: ICCV 2019 Accepted Pape
Improving Skeleton-based Action Recognitionwith Robust Spatial and Temporal Features
Recently skeleton-based action recognition has made signif-icant progresses
in the computer vision community. Most state-of-the-art algorithms are based on
Graph Convolutional Networks (GCN), andtarget at improving the network
structure of the backbone GCN lay-ers. In this paper, we propose a novel
mechanism to learn more robustdiscriminative features in space and time. More
specifically, we add aDiscriminative Feature Learning (DFL) branch to the last
layers of thenetwork to extract discriminative spatial and temporal features to
helpregularize the learning. We also formally advocate the use of
Direction-Invariant Features (DIF) as input to the neural networks. We show
thataction recognition accuracy can be improved when these robust featuresare
learned and used. We compare our results with those of ST-GCNand related
methods on four datasets: NTU-RGBD60, NTU-RGBD120,SYSU 3DHOI and
Skeleton-Kinetics
Directional Temporal Modeling for Action Recognition
Many current activity recognition models use 3D convolutional neural networks
(e.g. I3D, I3D-NL) to generate local spatial-temporal features. However, such
features do not encode clip-level ordered temporal information. In this paper,
we introduce a channel independent directional convolution (CIDC) operation,
which learns to model the temporal evolution among local features. By applying
multiple CIDC units we construct a light-weight network that models the
clip-level temporal evolution across multiple spatial scales. Our CIDC network
can be attached to any activity recognition backbone network. We evaluate our
method on four popular activity recognition datasets and consistently improve
upon state-of-the-art techniques. We further visualize the activation map of
our CIDC network and show that it is able to focus on more meaningful, action
related parts of the frame.Comment: ECCV 202
t-EVA: Time-Efficient t-SNE Video Annotation
Video understanding has received more attention in the past few years due to
the availability of several large-scale video datasets. However, annotating
large-scale video datasets are cost-intensive. In this work, we propose a
time-efficient video annotation method using spatio-temporal feature similarity
and t-SNE dimensionality reduction to speed up the annotation process
massively. Placing the same actions from different videos near each other in
the two-dimensional space based on feature similarity helps the annotator to
group-label video clips. We evaluate our method on two subsets of the
ActivityNet (v1.3) and a subset of the Sports-1M dataset. We show that t-EVA
can outperform other video annotation tools while maintaining test accuracy on
video classification.Comment: ICPR 2020 (HCAU
Online Learnable Keyframe Extraction in Videos and its Application with Semantic Word Vector in Action Recognition
Video processing has become a popular research direction in computer vision
due to its various applications such as video summarization, action
recognition, etc. Recently, deep learning-based methods have achieved
impressive results in action recognition. However, these methods need to
process a full video sequence to recognize the action, even though most of
these frames are similar and non-essential to recognizing a particular action.
Additionally, these non-essential frames increase the computational cost and
can confuse a method in action recognition. Instead, the important frames
called keyframes not only are helpful in the recognition of an action but also
can reduce the processing time of each video sequence for classification or in
other applications, e.g. summarization. As well, current methods in video
processing have not yet been demonstrated in an online fashion.
Motivated by the above, we propose an online learnable module for keyframe
extraction. This module can be used to select key-shots in video and thus can
be applied to video summarization. The extracted keyframes can be used as input
to any deep learning-based classification model to recognize action. We also
propose a plugin module to use the semantic word vector as input along with
keyframes and a novel train/test strategy for the classification models. To our
best knowledge, this is the first time such an online module and train/test
strategy have been proposed.
The experimental results on many commonly used datasets in video
summarization and in action recognition have shown impressive results using the
proposed module
Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention
Attentive video modeling is essential for action recognition in unconstrained
videos due to their rich yet redundant information over space and time.
However, introducing attention in a deep neural network for action recognition
is challenging for two reasons. First, an effective attention module needs to
learn what (objects and their local motion patterns), where (spatially), and
when (temporally) to focus on. Second, a video attention module must be
efficient because existing action recognition models already suffer from high
computational cost. To address both challenges, a novel What-Where-When (W3)
video attention module is proposed. Departing from existing alternatives, our
W3 module models all three facets of video attention jointly. Crucially, it is
extremely efficient by factorizing the high-dimensional video feature data into
low-dimensional meaningful spaces (1D channel vector for `what' and 2D spatial
tensors for `where'), followed by lightweight temporal attention reasoning.
Extensive experiments show that our attention model brings significant
improvements to existing action recognition models, achieving new
state-of-the-art performance on a number of benchmarks
Gate-Shift Networks for Video Action Recognition
Deep 3D CNNs for video action recognition are designed to learn powerful
representations in the joint spatio-temporal feature space. In practice
however, because of the large number of parameters and computations involved,
they may under-perform in the lack of sufficiently large datasets for training
them at scale. In this paper we introduce spatial gating in spatial-temporal
decomposition of 3D kernels. We implement this concept with Gate-Shift Module
(GSM). GSM is lightweight and turns a 2D-CNN into a highly efficient
spatio-temporal feature extractor. With GSM plugged in, a 2D-CNN learns to
adaptively route features through time and combine them, at almost no
additional parameters and computational overhead. We perform an extensive
evaluation of the proposed module to study its effectiveness in video action
recognition, achieving state-of-the-art results on Something Something-V1 and
Diving48 datasets, and obtaining competitive results on EPIC-Kitchens with far
less model complexity.Comment: CVPR20 camera ready version. Code and models available at
https://github.com/swathikirans/GS
Training Interpretable Convolutional Neural Networks by Differentiating Class-specific Filters
Convolutional neural networks (CNNs) have been successfully used in a range
of tasks. However, CNNs are often viewed as "black-box" and lack of
interpretability. One main reason is due to the filter-class entanglement -- an
intricate many-to-many correspondence between filters and classes. Most
existing works attempt post-hoc interpretation on a pre-trained model, while
neglecting to reduce the entanglement underlying the model. In contrast, we
focus on alleviating filter-class entanglement during training. Inspired by
cellular differentiation, we propose a novel strategy to train interpretable
CNNs by encouraging class-specific filters, among which each filter responds to
only one (or few) class. Concretely, we design a learnable sparse
Class-Specific Gate (CSG) structure to assign each filter with one (or few)
class in a flexible way. The gate allows a filter's activation to pass only
when the input samples come from the specific class. Extensive experiments
demonstrate the fabulous performance of our method in generating a sparse and
highly class-related representation of the input, which leads to stronger
interpretability. Moreover, comparing with the standard training strategy, our
model displays benefits in applications like object localization and
adversarial sample detection. Code link: https://github.com/hyliang96/CSGCNN.Comment: European Conference on Computer Vision (ECCV), 202
Universal-to-Specific Framework for Complex Action Recognition
Video-based action recognition has recently attracted much attention in the
field of computer vision. To solve more complex recognition tasks, it has
become necessary to distinguish different levels of interclass variations.
Inspired by a common flowchart based on the human decision-making process that
first narrows down the probable classes and then applies a "rethinking" process
for finer-level recognition, we propose an effective universal-to-specific
(U2S) framework for complex action recognition. The U2S framework is composed
of three subnetworks: a universal network, a category-specific network, and a
mask network. The universal network first learns universal feature
representations. The mask network then generates attention masks for confusing
classes through category regularization based on the output of the universal
network. The mask is further used to guide the category-specific network for
class-specific feature representations. The entire framework is optimized in an
end-to-end manner. Experiments on a variety of benchmark datasets, e.g., the
Something-Something, UCF101, and HMDB51 datasets, demonstrate the effectiveness
of the U2S framework; i.e., U2S can focus on discriminative spatiotemporal
regions for confusing categories. We further visualize the relationship between
different classes, showing that U2S indeed improves the discriminability of
learned features. Moreover, the proposed U2S model is a general framework and
may adopt any base recognition network.Comment: 13 pages, 8 figure
Recent Progress in Appearance-based Action Recognition
Action recognition, which is formulated as a task to identify various human
actions in a video, has attracted increasing interest from computer vision
researchers due to its importance in various applications. Recently,
appearance-based methods have achieved promising progress towards accurate
action recognition. In general, these methods mainly fulfill the task by
applying various schemes to model spatial and temporal visual information
effectively. To better understand the current progress of appearance-based
action recognition, we provide a comprehensive review of recent achievements in
this area. In particular, we summarise and discuss several dozens of related
research papers, which can be roughly divided into four categories according to
different appearance modelling strategies. The obtained categories include 2D
convolutional methods, 3D convolutional methods, motion representation-based
methods, and context representation-based methods. We analyse and discuss
representative methods from each category, comprehensively. Empirical results
are also summarised to better illustrate cutting-edge algorithms. We conclude
by identifying important areas for future research gleaned from our
categorisation