673 research outputs found
Fully-Coupled Two-Stream Spatiotemporal Networks for Extremely Low Resolution Action Recognition
A major emerging challenge is how to protect people's privacy as cameras and
computer vision are increasingly integrated into our daily lives, including in
smart devices inside homes. A potential solution is to capture and record just
the minimum amount of information needed to perform a task of interest. In this
paper, we propose a fully-coupled two-stream spatiotemporal architecture for
reliable human action recognition on extremely low resolution (e.g., 12x16
pixel) videos. We provide an efficient method to extract spatial and temporal
features and to aggregate them into a robust feature representation for an
entire action video sequence. We also consider how to incorporate high
resolution videos during training in order to build better low resolution
action recognition models. We evaluate on two publicly-available datasets,
showing significant improvements over the state-of-the-art.Comment: 9 pagers, 5 figures, published in WACV 201
Multi-View Region Adaptive Multi-temporal DMM and RGB Action Recognition
Human action recognition remains an important yet challenging task. This work
proposes a novel action recognition system. It uses a novel Multiple View
Region Adaptive Multi-resolution in time Depth Motion Map (MV-RAMDMM)
formulation combined with appearance information. Multiple stream 3D
Convolutional Neural Networks (CNNs) are trained on the different views and
time resolutions of the region adaptive Depth Motion Maps. Multiple views are
synthesised to enhance the view invariance. The region adaptive weights, based
on localised motion, accentuate and differentiate parts of actions possessing
faster motion. Dedicated 3D CNN streams for multi-time resolution appearance
information (RGB) are also included. These help to identify and differentiate
between small object interactions. A pre-trained 3D-CNN is used here with
fine-tuning for each stream along with multiple class Support Vector Machines
(SVM)s. Average score fusion is used on the output. The developed approach is
capable of recognising both human action and human-object interaction. Three
public domain datasets including: MSR 3D Action,Northwestern UCLA multi-view
actions and MSR 3D daily activity are used to evaluate the proposed solution.
The experimental results demonstrate the robustness of this approach compared
with state-of-the-art algorithms.Comment: 14 pages, 6 figures, 13 tables. Submitte
Identifying First-person Camera Wearers in Third-person Videos
We consider scenarios in which we wish to perform joint scene understanding,
object tracking, activity recognition, and other tasks in environments in which
multiple people are wearing body-worn cameras while a third-person static
camera also captures the scene. To do this, we need to establish person-level
correspondences across first- and third-person videos, which is challenging
because the camera wearer is not visible from his/her own egocentric video,
preventing the use of direct feature matching. In this paper, we propose a new
semi-Siamese Convolutional Neural Network architecture to address this novel
challenge. We formulate the problem as learning a joint embedding space for
first- and third-person videos that considers both spatial- and motion-domain
cues. A new triplet loss function is designed to minimize the distance between
correct first- and third-person matches while maximizing the distance between
incorrect ones. This end-to-end approach performs significantly better than
several baselines, in part by learning the first- and third-person features
optimized for matching jointly with the distance measure itself
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
The paucity of videos in current action classification datasets (UCF-101 and
HMDB-51) has made it difficult to identify good video architectures, as most
methods obtain similar performance on existing small-scale benchmarks. This
paper re-evaluates state-of-the-art architectures in light of the new Kinetics
Human Action Video dataset. Kinetics has two orders of magnitude more data,
with 400 human action classes and over 400 clips per class, and is collected
from realistic, challenging YouTube videos. We provide an analysis on how
current architectures fare on the task of action classification on this dataset
and how much performance improves on the smaller benchmark datasets after
pre-training on Kinetics.
We also introduce a new Two-Stream Inflated 3D ConvNet (I3D) that is based on
2D ConvNet inflation: filters and pooling kernels of very deep image
classification ConvNets are expanded into 3D, making it possible to learn
seamless spatio-temporal feature extractors from video while leveraging
successful ImageNet architecture designs and even their parameters. We show
that, after pre-training on Kinetics, I3D models considerably improve upon the
state-of-the-art in action classification, reaching 80.9% on HMDB-51 and 98.0%
on UCF-101.Comment: Removed references to mini-kinetics dataset that was never made
publicly available and repeated all experiments on the full Kinetics datase
- …