5,769 research outputs found
Learning Spatiotemporal Features for Infrared Action Recognition with 3D Convolutional Neural Networks
Infrared (IR) imaging has the potential to enable more robust action
recognition systems compared to visible spectrum cameras due to lower
sensitivity to lighting conditions and appearance variability. While the action
recognition task on videos collected from visible spectrum imaging has received
much attention, action recognition in IR videos is significantly less explored.
Our objective is to exploit imaging data in this modality for the action
recognition task. In this work, we propose a novel two-stream 3D convolutional
neural network (CNN) architecture by introducing the discriminative code layer
and the corresponding discriminative code loss function. The proposed network
processes IR image and the IR-based optical flow field sequences. We pretrain
the 3D CNN model on the visible spectrum Sports-1M action dataset and finetune
it on the Infrared Action Recognition (InfAR) dataset. To our best knowledge,
this is the first application of the 3D CNN to action recognition in the IR
domain. We conduct an elaborate analysis of different fusion schemes (weighted
average, single and double-layer neural nets) applied to different 3D CNN
outputs. Experimental results demonstrate that our approach can achieve
state-of-the-art average precision (AP) performances on the InfAR dataset: (1)
the proposed two-stream 3D CNN achieves the best reported 77.5% AP, and (2) our
3D CNN model applied to the optical flow fields achieves the best reported
single stream 75.42% AP
Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos
Deep learning has been demonstrated to achieve excellent results for image
classification and object detection. However, the impact of deep learning on
video analysis (e.g. action detection and recognition) has been limited due to
complexity of video data and lack of annotations. Previous convolutional neural
networks (CNN) based video action detection approaches usually consist of two
major steps: frame-level action proposal detection and association of proposals
across frames. Also, these methods employ two-stream CNN framework to handle
spatial and temporal feature separately. In this paper, we propose an
end-to-end deep network called Tube Convolutional Neural Network (T-CNN) for
action detection in videos. The proposed architecture is a unified network that
is able to recognize and localize action based on 3D convolution features. A
video is first divided into equal length clips and for each clip a set of tube
proposals are generated next based on 3D Convolutional Network (ConvNet)
features. Finally, the tube proposals of different clips are linked together
employing network flow and spatio-temporal action detection is performed using
these linked video proposals. Extensive experiments on several video datasets
demonstrate the superior performance of T-CNN for classifying and localizing
actions in both trimmed and untrimmed videos compared to state-of-the-arts
Improved two-stream model for human action recognition
This paper addresses the recognitions of human actions in videos. Human action recognition can be seen as the automatic labeling of a video according to the actions occurring in it. It has become one of the most challenging and attractive problems in the pattern recognition and video classification fields. The problem itself is difficult to solve by traditional video processing methods because of several challenges such as the background noise, sizes of subjects in different videos, and the speed of actions. Derived from the progress of deep learning methods, several directions are developed to recognize a human action from a video, such as the long-short-term memory (LSTM)-based model, two-stream convolutional neural network (CNN) model, and the convolutional 3D model.In this paper, we focus on the two-stream structure. The traditional two-stream CNN network solves the problem that CNNs do not have satisfactory performance on temporal features. By training a temporal stream, which uses the optical flow as the input, a CNN can have the ability to extract temporal features. However, the optical flow only contains limited temporal information because it only records the movements of pixels on the x-axis and the y-axis. Therefore, we attempt to design and implement a new two-stream model by using an LSTM-based model in its spatial stream to extract both spatial and temporal features in RGB frames. In addition, we implement a DenseNet in the temporal stream to improve the recognition accuracy. This is in-contrast to traditional approaches which typically utilize the spatial stream for extracting only spatial features. The quantitative evaluation and experiments are conducted on the UCF-101 dataset, which is a well-developed public video dataset. For the temporal stream, we choose the optical flow of UCF-101. Images in the optical flow are provided by the Graz University of Technology. The experimental result shows that the proposed method outperforms the traditional two-stream CNN method with an accuracy of at least 3%. For both spatial and temporal streams, the proposed model also achieves higher recognition accuracies. In addition, compared with the state of the art methods, the new model can still have the best recognition performance
Going Deeper into Action Recognition: A Survey
Understanding human actions in visual data is tied to advances in
complementary research areas including object recognition, human dynamics,
domain adaptation and semantic segmentation. Over the last decade, human action
analysis evolved from earlier schemes that are often limited to controlled
environments to nowadays advanced solutions that can learn from millions of
videos and apply to almost all daily activities. Given the broad range of
applications from video surveillance to human-computer interaction, scientific
milestones in action recognition are achieved more rapidly, eventually leading
to the demise of what used to be good in a short time. This motivated us to
provide a comprehensive review of the notable steps taken towards recognizing
human actions. To this end, we start our discussion with the pioneering methods
that use handcrafted representations, and then, navigate into the realm of deep
learning based approaches. We aim to remain objective throughout this survey,
touching upon encouraging improvements as well as inevitable fallbacks, in the
hope of raising fresh questions and motivating new research directions for the
reader
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
The paucity of videos in current action classification datasets (UCF-101 and
HMDB-51) has made it difficult to identify good video architectures, as most
methods obtain similar performance on existing small-scale benchmarks. This
paper re-evaluates state-of-the-art architectures in light of the new Kinetics
Human Action Video dataset. Kinetics has two orders of magnitude more data,
with 400 human action classes and over 400 clips per class, and is collected
from realistic, challenging YouTube videos. We provide an analysis on how
current architectures fare on the task of action classification on this dataset
and how much performance improves on the smaller benchmark datasets after
pre-training on Kinetics.
We also introduce a new Two-Stream Inflated 3D ConvNet (I3D) that is based on
2D ConvNet inflation: filters and pooling kernels of very deep image
classification ConvNets are expanded into 3D, making it possible to learn
seamless spatio-temporal feature extractors from video while leveraging
successful ImageNet architecture designs and even their parameters. We show
that, after pre-training on Kinetics, I3D models considerably improve upon the
state-of-the-art in action classification, reaching 80.9% on HMDB-51 and 98.0%
on UCF-101.Comment: Removed references to mini-kinetics dataset that was never made
publicly available and repeated all experiments on the full Kinetics datase
Action recognition based on efficient deep feature learning in the spatio-temporal domain
© 20xx IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Hand-crafted feature functions are usually designed based on the domain knowledge of a presumably controlled environment and often fail to generalize, as the statistics of real-world data cannot always be modeled correctly. Data-driven feature learning methods, on the other hand, have emerged as an alternative that often generalize better in uncontrolled environments. We present a simple, yet robust, 2D convolutional neural network extended to a concatenated 3D network that learns to extract features from the spatio-temporal domain of raw video data. The resulting network model is used for content-based recognition of videos. Relying on a 2D convolutional neural network allows us to exploit a pretrained network as a descriptor that yielded the best results on the largest and challenging ILSVRC-2014 dataset. Experimental results on commonly used benchmarking video datasets demonstrate that our results are state-of-the-art in terms of accuracy and computational time without requiring any preprocessing (e.g., optic flow) or a priori knowledge on data capture (e.g., camera motion estimation), which makes it more general and flexible than other approaches. Our implementation is made available.Peer ReviewedPostprint (author's final draft
Activity Recognition based on a Magnitude-Orientation Stream Network
The temporal component of videos provides an important clue for activity
recognition, as a number of activities can be reliably recognized based on the
motion information. In view of that, this work proposes a novel temporal stream
for two-stream convolutional networks based on images computed from the optical
flow magnitude and orientation, named Magnitude-Orientation Stream (MOS), to
learn the motion in a better and richer manner. Our method applies simple
nonlinear transformations on the vertical and horizontal components of the
optical flow to generate input images for the temporal stream. Experimental
results, carried on two well-known datasets (HMDB51 and UCF101), demonstrate
that using our proposed temporal stream as input to existing neural network
architectures can improve their performance for activity recognition. Results
demonstrate that our temporal stream provides complementary information able to
improve the classical two-stream methods, indicating the suitability of our
approach to be used as a temporal video representation.Comment: 8 pages, SIBGRAPI 201
- …