26,762 research outputs found
Recurrent Regression for Face Recognition
To address the sequential changes of images including poses, in this paper we
propose a recurrent regression neural network(RRNN) framework to unify two
classic tasks of cross-pose face recognition on still images and video-based
face recognition. To imitate the changes of images, we explicitly construct the
potential dependencies of sequential images so as to regularize the final
learning model. By performing progressive transforms for sequentially adjacent
images, RRNN can adaptively memorize and forget the information that benefits
for the final classification. For face recognition of still images, given any
one image with any one pose, we recurrently predict the images with its
sequential poses to expect to capture some useful information of others poses.
For video-based face recognition, the recurrent regression takes one entire
sequence rather than one image as its input. We verify RRNN in static face
dataset MultiPIE and face video dataset YouTube Celebrities(YTC). The
comprehensive experimental results demonstrate the effectiveness of the
proposed RRNN method
Recurrent Convolutional Neural Network Regression for Continuous Pain Intensity Estimation in Video
Automatic pain intensity estimation possesses a significant position in
healthcare and medical field. Traditional static methods prefer to extract
features from frames separately in a video, which would result in unstable
changes and peaks among adjacent frames. To overcome this problem, we propose a
real-time regression framework based on the recurrent convolutional neural
network for automatic frame-level pain intensity estimation. Given vector
sequences of AAM-warped facial images, we used a sliding-window strategy to
obtain fixed-length input samples for the recurrent network. We then carefully
design the architecture of the recurrent network to output continuous-valued
pain intensity. The proposed end-to-end pain intensity regression framework can
predict the pain intensity of each frame by considering a sufficiently large
historical frames while limiting the scale of the parameters within the model.
Our method achieves promising results regarding both accuracy and running speed
on the published UNBC-McMaster Shoulder Pain Expression Archive Database.Comment: This paper is the pre-print technical report of the paper accepted by
the IEEE CVPR Workshop of Affect "in-the-wild". The final version will be
available after the worksho
Feature Extraction via Recurrent Random Deep Ensembles and its Application in Gruop-level Happiness Estimation
This paper presents a novel ensemble framework to extract highly
discriminative feature representation of image and its application for
group-level happpiness intensity prediction in wild. In order to generate
enough diversity of decisions, n convolutional neural networks are trained by
bootstrapping the training set and extract n features for each image from them.
A recurrent neural network (RNN) is then used to remember which network
extracts better feature and generate the final feature representation for one
individual image. Several group emotion models (GEM) are used to aggregate face
fea- tures in a group and use parameter-optimized support vector regressor
(SVR) to get the final results. Through extensive experiments, the great
effectiveness of the proposed recurrent random deep ensembles (RRDE) is
demonstrated in both structural and decisional ways. The best result yields a
0.55 root-mean-square error (RMSE) on validation set of HAPPEI dataset,
significantly better than the baseline of 0.78
How Deep Neural Networks Can Improve Emotion Recognition on Video Data
We consider the task of dimensional emotion recognition on video data using
deep learning. While several previous methods have shown the benefits of
training temporal neural network models such as recurrent neural networks
(RNNs) on hand-crafted features, few works have considered combining
convolutional neural networks (CNNs) with RNNs. In this work, we present a
system that performs emotion recognition on video data using both CNNs and
RNNs, and we also analyze how much each neural network component contributes to
the system's overall performance. We present our findings on videos from the
Audio/Visual+Emotion Challenge (AV+EC2015). In our experiments, we analyze the
effects of several hyperparameters on overall performance while also achieving
superior performance to the baseline and other competing methods.Comment: Accepted at ICIP 2016. Fixed typo in Experiments sectio
Attended End-to-end Architecture for Age Estimation from Facial Expression Videos
The main challenges of age estimation from facial expression videos lie not
only in the modeling of the static facial appearance, but also in the capturing
of the temporal facial dynamics. Traditional techniques to this problem focus
on constructing handcrafted features to explore the discriminative information
contained in facial appearance and dynamics separately. This relies on
sophisticated feature-refinement and framework-design. In this paper, we
present an end-to-end architecture for age estimation, called Spatially-Indexed
Attention Model (SIAM), which is able to simultaneously learn both the
appearance and dynamics of age from raw videos of facial expressions.
Specifically, we employ convolutional neural networks to extract effective
latent appearance representations and feed them into recurrent networks to
model the temporal dynamics. More importantly, we propose to leverage attention
models for salience detection in both the spatial domain for each single image
and the temporal domain for the whole video as well. We design a specific
spatially-indexed attention mechanism among the convolutional layers to extract
the salient facial regions in each individual image, and a temporal attention
layer to assign attention weights to each frame. This two-pronged approach not
only improves the performance by allowing the model to focus on informative
frames and facial areas, but it also offers an interpretable correspondence
between the spatial facial regions as well as temporal frames, and the task of
age estimation. We demonstrate the strong performance of our model in
experiments on a large, gender-balanced database with 400 subjects with ages
spanning from 8 to 76 years. Experiments reveal that our model exhibits
significant superiority over the state-of-the-art methods given sufficient
training data.Comment: Accepted by Transactions on Image Processing (TIP
End-to-End Deep Learning for Steering Autonomous Vehicles Considering Temporal Dependencies
Steering a car through traffic is a complex task that is difficult to cast
into algorithms. Therefore, researchers turn to training artificial neural
networks from front-facing camera data stream along with the associated
steering angles. Nevertheless, most existing solutions consider only the visual
camera frames as input, thus ignoring the temporal relationship between frames.
In this work, we propose a Convolutional Long Short-Term Memory Recurrent
Neural Network (C-LSTM), that is end-to-end trainable, to learn both visual and
dynamic temporal dependencies of driving. Additionally, We introduce posing the
steering angle regression problem as classification while imposing a spatial
relationship between the output layer neurons. Such method is based on learning
a sinusoidal function that encodes steering angles. To train and validate our
proposed methods, we used the publicly available Comma.ai dataset. Our solution
improved steering root mean square error by 35% over recent methods, and led to
a more stable steering by 87%.Comment: 31st Conference on Neural Information Processing Systems (NIPS),
Machine Learning for Intelligent Transportation Systems Workshop, Long Beach,
CA, USA, 201
Saliency Supervision: An Intuitive and Effective Approach for Pain Intensity Regression
Getting pain intensity from face images is an important problem in autonomous
nursing systems. However, due to the limitation in data sources and the
subjectiveness in pain intensity values, it is hard to adopt modern deep neural
networks for this problem without domain-specific auxiliary design. Inspired by
human vision priori, we propose a novel approach called saliency supervision,
where we directly regularize deep networks to focus on facial area that is
discriminative for pain regression. Through alternative training between
saliency supervision and global loss, our method can learn sparse and robust
features, which is proved helpful for pain intensity regression. We verified
saliency supervision with face-verification network backbone on the widely-used
dataset, and achieved state-of-art performance without bells and whistles. Our
saliency supervision is intuitive in spirit, yet effective in performance. We
believe such saliency supervision is essential in dealing with ill-posed
datasets, and has potential in a wide range of vision tasks
Long short-term memory networks in memristor crossbars
Recent breakthroughs in recurrent deep neural networks with long short-term
memory (LSTM) units has led to major advances in artificial intelligence.
State-of-the-art LSTM models with significantly increased complexity and a
large number of parameters, however, have a bottleneck in computing power
resulting from limited memory capacity and data communication bandwidth. Here
we demonstrate experimentally that LSTM can be implemented with a memristor
crossbar, which has a small circuit footprint to store a large number of
parameters and in-memory computing capability that circumvents the 'von Neumann
bottleneck'. We illustrate the capability of our system by solving real-world
problems in regression and classification, which shows that memristor LSTM is a
promising low-power and low-latency hardware platform for edge inference
Tensor Fusion Network for Multimodal Sentiment Analysis
Multimodal sentiment analysis is an increasingly popular research area, which
extends the conventional language-based definition of sentiment analysis to a
multimodal setup where other relevant modalities accompany language. In this
paper, we pose the problem of multimodal sentiment analysis as modeling
intra-modality and inter-modality dynamics. We introduce a novel model, termed
Tensor Fusion Network, which learns both such dynamics end-to-end. The proposed
approach is tailored for the volatile nature of spoken language in online
videos as well as accompanying gestures and voice. In the experiments, our
model outperforms state-of-the-art approaches for both multimodal and unimodal
sentiment analysis.Comment: Accepted as full paper in EMNLP 201
Deep Learning Algorithms with Applications to Video Analytics for A Smart City: A Survey
Deep learning has recently achieved very promising results in a wide range of
areas such as computer vision, speech recognition and natural language
processing. It aims to learn hierarchical representations of data by using deep
architecture models. In a smart city, a lot of data (e.g. videos captured from
many distributed sensors) need to be automatically processed and analyzed. In
this paper, we review the deep learning algorithms applied to video analytics
of smart city in terms of different research topics: object detection, object
tracking, face recognition, image classification and scene labeling.Comment: 8 pages, 18 figure
- …