4 research outputs found
Resource Efficient 3D Convolutional Neural Networks
Recently, convolutional neural networks with 3D kernels (3D CNNs) have been
very popular in computer vision community as a result of their superior ability
of extracting spatio-temporal features within video frames compared to 2D CNNs.
Although there has been great advances recently to build resource efficient 2D
CNN architectures considering memory and power budget, there is hardly any
similar resource efficient architectures for 3D CNNs. In this paper, we have
converted various well-known resource efficient 2D CNNs to 3D CNNs and
evaluated their performance on three major benchmarks in terms of
classification accuracy for different complexity levels. We have experimented
on (1) Kinetics-600 dataset to inspect their capacity to learn, (2) Jester
dataset to inspect their ability to capture motion patterns, and (3) UCF-101 to
inspect the applicability of transfer learning. We have evaluated the run-time
performance of each model on a single Titan XP GPU and a Jetson TX2 embedded
system. The results of this study show that these models can be utilized for
different types of real-world applications since they provide real-time
performance with considerable accuracies and memory usage. Our analysis on
different complexity levels shows that the resource efficient 3D CNNs should
not be designed too shallow or narrow in order to save complexity. The codes
and pretrained models used in this work are publicly available.Comment: Accepted to ICCV 2019 workshop - Neural Architect
Comparative Analysis of CNN-based Spatiotemporal Reasoning in Videos
Understanding actions and gestures in video streams requires temporal
reasoning of the spatial content from different time instants, i.e.,
spatiotemporal (ST) modeling. In this survey paper, we have made a comparative
analysis of different ST modeling techniques for action and gecture recognition
tasks. Since Convolutional Neural Networks (CNNs) are proved to be an effective
tool as a feature extractor for static images, we apply ST modeling techniques
on the features of static images from different time instants extracted by
CNNs. All techniques are trained end-to-end together with a CNN feature
extraction part and evaluated on two publicly available benchmarks: The Jester
and the Something-Something datasets. The Jester dataset contains various
dynamic and static hand gestures, whereas the Something-Something dataset
contains actions of human-object interactions. The common characteristic of
these two benchmarks is that the designed architectures need to capture the
full temporal content of videos in order to correctly classify
actions/gestures. Contrary to expectations, experimental results show that
Recurrent Neural Network (RNN) based ST modeling techniques yield inferior
results compared to other techniques such as fully convolutional architectures.
Codes and pretrained models of this work are publicly available
DriverMHG: A Multi-Modal Dataset for Dynamic Recognition of Driver Micro Hand Gestures and a Real-Time Recognition Framework
The use of hand gestures provides a natural alternative to cumbersome
interface devices for Human-Computer Interaction (HCI) systems. However,
real-time recognition of dynamic micro hand gestures from video streams is
challenging for in-vehicle scenarios since (i) the gestures should be performed
naturally without distracting the driver, (ii) micro hand gestures occur within
very short time intervals at spatially constrained areas, (iii) the performed
gesture should be recognized only once, and (iv) the entire architecture should
be designed lightweight as it will be deployed to an embedded system. In this
work, we propose an HCI system for dynamic recognition of driver micro hand
gestures, which can have a crucial impact in automotive sector especially for
safety related issues. For this purpose, we initially collected a dataset named
Driver Micro Hand Gestures (DriverMHG), which consists of RGB, depth and
infrared modalities. The challenges for dynamic recognition of micro hand
gestures have been addressed by proposing a lightweight convolutional neural
network (CNN) based architecture which operates online efficiently with a
sliding window approach. For the CNN model, several 3-dimensional resource
efficient networks are applied and their performances are analyzed. Online
recognition of gestures has been performed with 3D-MobileNetV2, which provided
the best offline accuracy among the applied networks with similar computational
complexities. The final architecture is deployed on a driver simulator
operating in real-time. We make DriverMHG dataset and our source code publicly
available.Comment: Accepted to IEEE International Conference on Automatic Face and
Gesture Recognition (FG 2020
You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization
Spatiotemporal action localization requires the incorporation of two sources
of information into the designed architecture: (1) temporal information from
the previous frames and (2) spatial information from the key frame. Current
state-of-the-art approaches usually extract these information with separate
networks and use an extra mechanism for fusion to get detections. In this work,
we present YOWO, a unified CNN architecture for real-time spatiotemporal action
localization in video streams. YOWO is a single-stage architecture with two
branches to extract temporal and spatial information concurrently and predict
bounding boxes and action probabilities directly from video clips in one
evaluation. Since the whole architecture is unified, it can be optimized
end-to-end. The YOWO architecture is fast providing 34 frames-per-second on
16-frames input clips and 62 frames-per-second on 8-frames input clips, which
is currently the fastest state-of-the-art architecture on spatiotemporal action
localization task. Remarkably, YOWO outperforms the previous state-of-the art
results on J-HMDB-21 and UCF101-24 with an impressive improvement of ~3% and
~12%, respectively. Moreover, YOWO is the first and only single-stage
architecture that provides competitive results on AVA dataset. We make our code
and pretrained models publicly available