3,659 research outputs found
A Taxonomy of Deep Convolutional Neural Nets for Computer Vision
Traditional architectures for solving computer vision problems and the degree
of success they enjoyed have been heavily reliant on hand-crafted features.
However, of late, deep learning techniques have offered a compelling
alternative -- that of automatically learning problem-specific features. With
this new paradigm, every problem in computer vision is now being re-examined
from a deep learning perspective. Therefore, it has become important to
understand what kind of deep networks are suitable for a given problem.
Although general surveys of this fast-moving paradigm (i.e. deep-networks)
exist, a survey specific to computer vision is missing. We specifically
consider one form of deep networks widely used in computer vision -
convolutional neural networks (CNNs). We start with "AlexNet" as our base CNN
and then examine the broad variations proposed over time to suit different
applications. We hope that our recipe-style survey will serve as a guide,
particularly for novice practitioners intending to use deep-learning techniques
for computer vision.Comment: Published in Frontiers in Robotics and AI (http://goo.gl/6691Bm
Convolutional Neural Network on Three Orthogonal Planes for Dynamic Texture Classification
Dynamic Textures (DTs) are sequences of images of moving scenes that exhibit
certain stationarity properties in time such as smoke, vegetation and fire. The
analysis of DT is important for recognition, segmentation, synthesis or
retrieval for a range of applications including surveillance, medical imaging
and remote sensing. Deep learning methods have shown impressive results and are
now the new state of the art for a wide range of computer vision tasks
including image and video recognition and segmentation. In particular,
Convolutional Neural Networks (CNNs) have recently proven to be well suited for
texture analysis with a design similar to a filter bank approach. In this
paper, we develop a new approach to DT analysis based on a CNN method applied
on three orthogonal planes x y , xt and y t . We train CNNs on spatial frames
and temporal slices extracted from the DT sequences and combine their outputs
to obtain a competitive DT classifier. Our results on a wide range of commonly
used DT classification benchmark datasets prove the robustness of our approach.
Significant improvement of the state of the art is shown on the larger
datasets.Comment: 19 pages, 10 figure
Facial Expression Recognition from World Wild Web
Recognizing facial expression in a wild setting has remained a challenging
task in computer vision. The World Wide Web is a good source of facial images
which most of them are captured in uncontrolled conditions. In fact, the
Internet is a Word Wild Web of facial images with expressions. This paper
presents the results of a new study on collecting, annotating, and analyzing
wild facial expressions from the web. Three search engines were queried using
1250 emotion related keywords in six different languages and the retrieved
images were mapped by two annotators to six basic expressions and neutral. Deep
neural networks and noise modeling were used in three different training
scenarios to find how accurately facial expressions can be recognized when
trained on noisy images collected from the web using query terms (e.g. happy
face, laughing man, etc)? The results of our experiments show that deep neural
networks can recognize wild facial expressions with an accuracy of 82.12%
Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn
This paper presents an image classification based approach for skeleton-based
video action recognition problem. Firstly, A dataset independent
translation-scale invariant image mapping method is proposed, which transformes
the skeleton videos to colour images, named skeleton-images. Secondly, A
multi-scale deep convolutional neural network (CNN) architecture is proposed
which could be built and fine-tuned on the powerful pre-trained CNNs, e.g.,
AlexNet, VGGNet, ResNet etal.. Even though the skeleton-images are very
different from natural images, the fine-tune strategy still works well. At
last, we prove that our method could also work well on 2D skeleton video data.
We achieve the state-of-the-art results on the popular benchmard datasets e.g.
NTU RGB+D, UTD-MHAD, MSRC-12, and G3D. Especially on the largest and challenge
NTU RGB+D, UTD-MHAD, and MSRC-12 dataset, our method outperforms other methods
by a large margion, which proves the efficacy of the proposed method
FEAFA: A Well-Annotated Dataset for Facial Expression Analysis and 3D Facial Animation
Facial expression analysis based on machine learning requires large number of
well-annotated data to reflect different changes in facial motion. Publicly
available datasets truly help to accelerate research in this area by providing
a benchmark resource, but all of these datasets, to the best of our knowledge,
are limited to rough annotations for action units, including only their
absence, presence, or a five-level intensity according to the Facial Action
Coding System. To meet the need for videos labeled in great detail, we present
a well-annotated dataset named FEAFA for Facial Expression Analysis and 3D
Facial Animation. One hundred and twenty-two participants, including children,
young adults and elderly people, were recorded in real-world conditions. In
addition, 99,356 frames were manually labeled using Expression Quantitative
Tool developed by us to quantify 9 symmetrical FACS action units, 10
asymmetrical (unilateral) FACS action units, 2 symmetrical FACS action
descriptors and 2 asymmetrical FACS action descriptors, and each action unit or
action descriptor is well-annotated with a floating point number between 0 and
1. To provide a baseline for use in future research, a benchmark for the
regression of action unit values based on Convolutional Neural Networks are
presented. We also demonstrate the potential of our FEAFA dataset for 3D facial
animation. Almost all state-of-the-art algorithms for facial animation are
achieved based on 3D face reconstruction. We hence propose a novel method that
drives virtual characters only based on action unit value regression of the 2D
video frames of source actors.Comment: 9 pages, 7 figure
Driver Distraction Identification with an Ensemble of Convolutional Neural Networks
The World Health Organization (WHO) reported 1.25 million deaths yearly due
to road traffic accidents worldwide and the number has been continuously
increasing over the last few years. Nearly fifth of these accidents are caused
by distracted drivers. Existing work of distracted driver detection is
concerned with a small set of distractions (mostly, cell phone usage).
Unreliable ad-hoc methods are often used.In this paper, we present the first
publicly available dataset for driver distraction identification with more
distraction postures than existing alternatives. In addition, we propose a
reliable deep learning-based solution that achieves a 90% accuracy. The system
consists of a genetically-weighted ensemble of convolutional neural networks,
we show that a weighted ensemble of classifiers using a genetic algorithm
yields in a better classification confidence. We also study the effect of
different visual elements in distraction detection by means of face and hand
localizations, and skin segmentation. Finally, we present a thinned version of
our ensemble that could achieve 84.64% classification accuracy and operate in a
real-time environment.Comment: arXiv admin note: substantial text overlap with arXiv:1706.0949
- …