206,581 research outputs found
Slow Feature Analysis for Human Action Recognition
Slow Feature Analysis (SFA) extracts slowly varying features from a quickly
varying input signal. It has been successfully applied to modeling the visual
receptive fields of the cortical neurons. Sufficient experimental results in
neuroscience suggest that the temporal slowness principle is a general learning
principle in visual perception. In this paper, we introduce the SFA framework
to the problem of human action recognition by incorporating the discriminative
information with SFA learning and considering the spatial relationship of body
parts. In particular, we consider four kinds of SFA learning strategies,
including the original unsupervised SFA (U-SFA), the supervised SFA (S-SFA),
the discriminative SFA (D-SFA), and the spatial discriminative SFA (SD-SFA), to
extract slow feature functions from a large amount of training cuboids which
are obtained by random sampling in motion boundaries. Afterward, to represent
action sequences, the squared first order temporal derivatives are accumulated
over all transformed cuboids into one feature vector, which is termed the
Accumulated Squared Derivative (ASD) feature. The ASD feature encodes the
statistical distribution of slow features in an action sequence. Finally, a
linear support vector machine (SVM) is trained to classify actions represented
by ASD features. We conduct extensive experiments, including two sets of
control experiments, two sets of large scale experiments on the KTH and
Weizmann databases, and two sets of experiments on the CASIA and UT-interaction
databases, to demonstrate the effectiveness of SFA for human action
recognition
Feature sampling and partitioning for visual vocabulary generation on large action classification datasets
The recent trend in action recognition is towards larger datasets, an
increasing number of action classes and larger visual vocabularies.
State-of-the-art human action classification in challenging video data is
currently based on a bag-of-visual-words pipeline in which space-time features
are aggregated globally to form a histogram. The strategies chosen to sample
features and construct a visual vocabulary are critical to performance, in fact
often dominating performance. In this work we provide a critical evaluation of
various approaches to building a vocabulary and show that good practises do
have a significant impact. By subsampling and partitioning features
strategically, we are able to achieve state-of-the-art results on 5 major
action recognition datasets using relatively small visual vocabularies
Unsupervised Representation Learning by Sorting Sequences
We present an unsupervised representation learning approach using videos
without semantic labels. We leverage the temporal coherence as a supervisory
signal by formulating representation learning as a sequence sorting task. We
take temporally shuffled frames (i.e., in non-chronological order) as inputs
and train a convolutional neural network to sort the shuffled sequences.
Similar to comparison-based sorting algorithms, we propose to extract features
from all frame pairs and aggregate them to predict the correct order. As
sorting shuffled image sequence requires an understanding of the statistical
temporal structure of images, training with such a proxy task allows us to
learn rich and generalizable visual representation. We validate the
effectiveness of the learned representation using our method as pre-training on
high-level recognition problems. The experimental results show that our method
compares favorably against state-of-the-art methods on action recognition,
image classification and object detection tasks.Comment: ICCV 2017. Project page: http://vllab1.ucmerced.edu/~hylee/OPN
Attention-based Temporal Weighted Convolutional Neural Network for Action Recognition
Research in human action recognition has accelerated significantly since the
introduction of powerful machine learning tools such as Convolutional Neural
Networks (CNNs). However, effective and efficient methods for incorporation of
temporal information into CNNs are still being actively explored in the recent
literature. Motivated by the popular recurrent attention models in the research
area of natural language processing, we propose the Attention-based Temporal
Weighted CNN (ATW), which embeds a visual attention model into a temporal
weighted multi-stream CNN. This attention model is simply implemented as
temporal weighting yet it effectively boosts the recognition performance of
video representations. Besides, each stream in the proposed ATW framework is
capable of end-to-end training, with both network parameters and temporal
weights optimized by stochastic gradient descent (SGD) with backpropagation.
Our experiments show that the proposed attention mechanism contributes
substantially to the performance gains with the more discriminative snippets by
focusing on more relevant video segments.Comment: 14th International Conference on Artificial Intelligence Applications
and Innovations (AIAI 2018), May 25-27, 2018, Rhodes, Greec
DiscrimNet: Semi-Supervised Action Recognition from Videos using Generative Adversarial Networks
We propose an action recognition framework using Gen- erative Adversarial
Networks. Our model involves train- ing a deep convolutional generative
adversarial network (DCGAN) using a large video activity dataset without la-
bel information. Then we use the trained discriminator from the GAN model as an
unsupervised pre-training step and fine-tune the trained discriminator model on
a labeled dataset to recognize human activities. We determine good network
architectural and hyperparameter settings for us- ing the discriminator from
DCGAN as a trained model to learn useful representations for action
recognition. Our semi-supervised framework using only appearance infor- mation
achieves superior or comparable performance to the current state-of-the-art
semi-supervised action recog- nition methods on two challenging video activity
datasets: UCF101 and HMDB51
Skeleton-based Action Recognition of People Handling Objects
In visual surveillance systems, it is necessary to recognize the behavior of
people handling objects such as a phone, a cup, or a plastic bag. In this
paper, to address this problem, we propose a new framework for recognizing
object-related human actions by graph convolutional networks using human and
object poses. In this framework, we construct skeletal graphs of reliable human
poses by selectively sampling the informative frames in a video, which include
human joints with high confidence scores obtained in pose estimation. The
skeletal graphs generated from the sampled frames represent human poses related
to the object position in both the spatial and temporal domains, and these
graphs are used as inputs to the graph convolutional networks. Through
experiments over an open benchmark and our own data sets, we verify the
validity of our framework in that our method outperforms the state-of-the-art
method for skeleton-based action recognition.Comment: Accepted in WACV 201
REPAIR: Removing Representation Bias by Dataset Resampling
Modern machine learning datasets can have biases for certain representations
that are leveraged by algorithms to achieve high performance without learning
to solve the underlying task. This problem is referred to as "representation
bias". The question of how to reduce the representation biases of a dataset is
investigated and a new dataset REPresentAtion bIas Removal (REPAIR) procedure
is proposed. This formulates bias minimization as an optimization problem,
seeking a weight distribution that penalizes examples easy for a classifier
built on a given feature representation. Bias reduction is then equated to
maximizing the ratio between the classification loss on the reweighted dataset
and the uncertainty of the ground-truth class labels. This is a minimax problem
that REPAIR solves by alternatingly updating classifier parameters and dataset
resampling weights, using stochastic gradient descent. An experimental set-up
is also introduced to measure the bias of any dataset for a given
representation, and the impact of this bias on the performance of recognition
models. Experiments with synthetic and action recognition data show that
dataset REPAIR can significantly reduce representation bias, and lead to
improved generalization of models trained on REPAIRed datasets. The tools used
for characterizing representation bias, and the proposed dataset REPAIR
algorithm, are available at https://github.com/JerryYLi/Dataset-REPAIR/.Comment: To appear in CVPR 201
Learning Spatiotemporal Features via Video and Text Pair Discrimination
Current video representations heavily rely on learning from manually
annotated video datasets which are time-consuming and expensive to acquire. We
observe videos are naturally accompanied by abundant text information such as
YouTube titles and Instagram captions. In this paper, we leverage this
visual-textual connection to learn spatiotemporal features in an efficient
weakly-supervised manner. We present a general cross-modal pair discrimination
(CPD) framework to capture this correlation between a video and its associated
text. Specifically, we adopt noise-contrastive estimation to tackle the
computational issue imposed by the huge amount of pair instance classes and
design a practical curriculum learning strategy. We train our CPD models on
both standard video dataset (Kinetics-210k) and uncurated web video dataset
(Instagram-300k) to demonstrate its effectiveness. Without further fine-tuning,
the learnt models obtain competitive results for action classification on
Kinetics under the linear classification protocol. Moreover, our visual model
provides an effective initialization to fine-tune on downstream tasks, which
yields a remarkable performance gain for action recognition on UCF101 and
HMDB51, compared with the existing state-of-the-art self-supervised training
methods. In addition, our CPD model yields a new state of the art for zero-shot
action recognition on UCF101 by directly utilizing the learnt visual-textual
embeddings. The code will be made available at
https://github.com/MCG-NJU/CPD-Video.Comment: Technical Repor
A novel learning-based frame pooling method for Event Detection
Detecting complex events in a large video collection crawled from video
websites is a challenging task. When applying directly good image-based feature
representation, e.g., HOG, SIFT, to videos, we have to face the problem of how
to pool multiple frame feature representations into one feature representation.
In this paper, we propose a novel learning-based frame pooling method. We
formulate the pooling weight learning as an optimization problem and thus our
method can automatically learn the best pooling weight configuration for each
specific event category. Experimental results conducted on TRECVID MED 2011
reveal that our method outperforms the commonly used average pooling and max
pooling strategies on both high-level and low-level 2D image features
Temporal Pyramid Pooling Based Convolutional Neural Networks for Action Recognition
Encouraged by the success of Convolutional Neural Networks (CNNs) in image
classification, recently much effort is spent on applying CNNs to video based
action recognition problems. One challenge is that video contains a varying
number of frames which is incompatible to the standard input format of CNNs.
Existing methods handle this issue either by directly sampling a fixed number
of frames or bypassing this issue by introducing a 3D convolutional layer which
conducts convolution in spatial-temporal domain.
To solve this issue, here we propose a novel network structure which allows
an arbitrary number of frames as the network input. The key of our solution is
to introduce a module consisting of an encoding layer and a temporal pyramid
pooling layer. The encoding layer maps the activation from previous layers to a
feature vector suitable for pooling while the temporal pyramid pooling layer
converts multiple frame-level activations into a fixed-length video-level
representation. In addition, we adopt a feature concatenation layer which
combines appearance information and motion information. Compared with the frame
sampling strategy, our method avoids the risk of missing any important frames.
Compared with the 3D convolutional method which requires a huge video dataset
for network training, our model can be learned on a small target dataset
because we can leverage the off-the-shelf image-level CNN for model parameter
initialization. Experiments on two challenging datasets, Hollywood2 and HMDB51,
demonstrate that our method achieves superior performance over state-of-the-art
methods while requiring much fewer training data
- …