3,449 research outputs found
Supervised COSMOS Autoencoder: Learning Beyond the Euclidean Loss!
Autoencoders are unsupervised deep learning models used for learning
representations. In literature, autoencoders have shown to perform well on a
variety of tasks spread across multiple domains, thereby establishing
widespread applicability. Typically, an autoencoder is trained to generate a
model that minimizes the reconstruction error between the input and the
reconstructed output, computed in terms of the Euclidean distance. While this
can be useful for applications related to unsupervised reconstruction, it may
not be optimal for classification. In this paper, we propose a novel Supervised
COSMOS Autoencoder which utilizes a multi-objective loss function to learn
representations that simultaneously encode the (i) "similarity" between the
input and reconstructed vectors in terms of their direction, (ii)
"distribution" of pixel values of the reconstruction with respect to the input
sample, while also incorporating (iii) "discriminability" in the feature
learning pipeline. The proposed autoencoder model incorporates a Cosine
similarity and Mahalanobis distance based loss function, along with supervision
via Mutual Information based loss. Detailed analysis of each component of the
proposed model motivates its applicability for feature learning in different
classification tasks. The efficacy of Supervised COSMOS autoencoder is
demonstrated via extensive experimental evaluations on different image
datasets. The proposed model outperforms existing algorithms on MNIST,
CIFAR-10, and SVHN databases. It also yields state-of-the-art results on
CelebA, LFWA, Adience, and IJB-A databases for attribute prediction and face
recognition, respectively
Face Recognition: From Traditional to Deep Learning Methods
Starting in the seventies, face recognition has become one of the most
researched topics in computer vision and biometrics. Traditional methods based
on hand-crafted features and traditional machine learning techniques have
recently been superseded by deep neural networks trained with very large
datasets. In this paper we provide a comprehensive and up-to-date literature
review of popular face recognition methods including both traditional
(geometry-based, holistic, feature-based and hybrid methods) and deep learning
methods
Deep Facial Expression Recognition: A Survey
With the transition of facial expression recognition (FER) from
laboratory-controlled to challenging in-the-wild conditions and the recent
success of deep learning techniques in various fields, deep neural networks
have increasingly been leveraged to learn discriminative representations for
automatic FER. Recent deep FER systems generally focus on two important issues:
overfitting caused by a lack of sufficient training data and
expression-unrelated variations, such as illumination, head pose and identity
bias. In this paper, we provide a comprehensive survey on deep FER, including
datasets and algorithms that provide insights into these intrinsic problems.
First, we describe the standard pipeline of a deep FER system with the related
background knowledge and suggestions of applicable implementations for each
stage. We then introduce the available datasets that are widely used in the
literature and provide accepted data selection and evaluation principles for
these datasets. For the state of the art in deep FER, we review existing novel
deep neural networks and related training strategies that are designed for FER
based on both static images and dynamic image sequences, and discuss their
advantages and limitations. Competitive performances on widely used benchmarks
are also summarized in this section. We then extend our survey to additional
related issues and application scenarios. Finally, we review the remaining
challenges and corresponding opportunities in this field as well as future
directions for the design of robust deep FER systems
Recurrent Regression for Face Recognition
To address the sequential changes of images including poses, in this paper we
propose a recurrent regression neural network(RRNN) framework to unify two
classic tasks of cross-pose face recognition on still images and video-based
face recognition. To imitate the changes of images, we explicitly construct the
potential dependencies of sequential images so as to regularize the final
learning model. By performing progressive transforms for sequentially adjacent
images, RRNN can adaptively memorize and forget the information that benefits
for the final classification. For face recognition of still images, given any
one image with any one pose, we recurrently predict the images with its
sequential poses to expect to capture some useful information of others poses.
For video-based face recognition, the recurrent regression takes one entire
sequence rather than one image as its input. We verify RRNN in static face
dataset MultiPIE and face video dataset YouTube Celebrities(YTC). The
comprehensive experimental results demonstrate the effectiveness of the
proposed RRNN method
From BoW to CNN: Two Decades of Texture Representation for Texture Classification
Texture is a fundamental characteristic of many types of images, and texture
representation is one of the essential and challenging problems in computer
vision and pattern recognition which has attracted extensive research
attention. Since 2000, texture representations based on Bag of Words (BoW) and
on Convolutional Neural Networks (CNNs) have been extensively studied with
impressive performance. Given this period of remarkable evolution, this paper
aims to present a comprehensive survey of advances in texture representation
over the last two decades. More than 200 major publications are cited in this
survey covering different aspects of the research, which includes (i) problem
description; (ii) recent advances in the broad categories of BoW-based,
CNN-based and attribute-based methods; and (iii) evaluation issues,
specifically benchmark datasets and state of the art results. In retrospect of
what has been achieved so far, the survey discusses open challenges and
directions for future research.Comment: Accepted by IJC
Pose-adaptive Hierarchical Attention Network for Facial Expression Recognition
Multi-view facial expression recognition (FER) is a challenging task because
the appearance of an expression varies in poses. To alleviate the influences of
poses, recent methods either perform pose normalization or learn separate FER
classifiers for each pose. However, these methods usually have two stages and
rely on good performance of pose estimators. Different from existing methods,
we propose a pose-adaptive hierarchical attention network (PhaNet) that can
jointly recognize the facial expressions and poses in unconstrained
environment. Specifically, PhaNet discovers the most relevant regions to the
facial expression by an attention mechanism in hierarchical scales, and the
most informative scales are then selected to learn the pose-invariant and
expression-discriminative representations. PhaNet is end-to-end trainable by
minimizing the hierarchical attention losses, the FER loss and pose loss with
dynamically learned loss weights. We validate the effectiveness of the proposed
PhaNet on three multi-view datasets (BU-3DFE, Multi-pie, and KDEF) and two
in-the-wild FER datasets (AffectNet and SFEW). Extensive experiments
demonstrate that our framework outperforms the state-of-the-arts under both
within-dataset and cross-dataset settings, achieving the average accuracies of
84.92\%, 93.53\%, 88.5\%, 54.82\% and 31.25\% respectively.Comment: 12 pages, 15 figure
Adversarial Discriminative Heterogeneous Face Recognition
The gap between sensing patterns of different face modalities remains a
challenging problem in heterogeneous face recognition (HFR). This paper
proposes an adversarial discriminative feature learning framework to close the
sensing gap via adversarial learning on both raw-pixel space and compact
feature space. This framework integrates cross-spectral face hallucination and
discriminative feature learning into an end-to-end adversarial network. In the
pixel space, we make use of generative adversarial networks to perform
cross-spectral face hallucination. An elaborate two-path model is introduced to
alleviate the lack of paired images, which gives consideration to both global
structures and local textures. In the feature space, an adversarial loss and a
high-order variance discrepancy loss are employed to measure the global and
local discrepancy between two heterogeneous distributions respectively. These
two losses enhance domain-invariant feature learning and modality independent
noise removing. Experimental results on three NIR-VIS databases show that our
proposed approach outperforms state-of-the-art HFR methods, without requiring
of complex network or large-scale training dataset
Occlusion-guided compact template learning for ensemble deep network-based pose-invariant face recognition
Concatenation of the deep network representations extracted from different
facial patches helps to improve face recognition performance. However, the
concatenated facial template increases in size and contains redundant
information. Previous solutions aim to reduce the dimensionality of the facial
template without considering the occlusion pattern of the facial patches. In
this paper, we propose an occlusion-guided compact template learning (OGCTL)
approach that only uses the information from visible patches to construct the
compact template. The compact face representation is not sensitive to the
number of patches that are used to construct the facial template and is more
suitable for incorporating the information from different view angles for
image-set based face recognition. Instead of using occlusion masks in face
matching (e.g., DPRFS [38]), the proposed method uses occlusion masks in
template construction and achieves significantly better image-set based face
verification performance on a challenging database with a template size that is
an order-of-magnitude smaller than DPRFS.Comment: Accepted by International Conference on Biometrics (ICB 2019) as an
Oral presentatio
Wasserstein CNN: Learning Invariant Features for NIR-VIS Face Recognition
Heterogeneous face recognition (HFR) aims to match facial images acquired
from different sensing modalities with mission-critical applications in
forensics, security and commercial sectors. However, HFR is a much more
challenging problem than traditional face recognition because of large
intra-class variations of heterogeneous face images and limited training
samples of cross-modality face image pairs. This paper proposes a novel
approach namely Wasserstein CNN (convolutional neural networks, or WCNN for
short) to learn invariant features between near-infrared and visual face images
(i.e. NIR-VIS face recognition). The low-level layers of WCNN are trained with
widely available face images in visual spectrum. The high-level layer is
divided into three parts, i.e., NIR layer, VIS layer and NIR-VIS shared layer.
The first two layers aims to learn modality-specific features and NIR-VIS
shared layer is designed to learn modality-invariant feature subspace.
Wasserstein distance is introduced into NIR-VIS shared layer to measure the
dissimilarity between heterogeneous feature distributions. So W-CNN learning
aims to achieve the minimization of Wasserstein distance between NIR
distribution and VIS distribution for invariant deep feature representation of
heterogeneous face images. To avoid the over-fitting problem on small-scale
heterogeneous face data, a correlation prior is introduced on the
fully-connected layers of WCNN network to reduce parameter space. This prior is
implemented by a low-rank constraint in an end-to-end network. The joint
formulation leads to an alternating minimization for deep feature
representation at training stage and an efficient computation for heterogeneous
data at testing stage. Extensive experiments on three challenging NIR-VIS face
recognition databases demonstrate the significant superiority of Wasserstein
CNN over state-of-the-art methods
LOAD: Local Orientation Adaptive Descriptor for Texture and Material Classification
In this paper, we propose a novel local feature, called Local Orientation
Adaptive Descriptor (LOAD), to capture regional texture in an image. In LOAD,
we proposed to define point description on an Adaptive Coordinate System (ACS),
adopt a binary sequence descriptor to capture relationships between one point
and its neighbors and use multi-scale strategy to enhance the discriminative
power of the descriptor. The proposed LOAD enjoys not only discriminative power
to capture the texture information, but also has strong robustness to
illumination variation and image rotation. Extensive experiments on benchmark
data sets of texture classification and real-world material recognition show
that the proposed LOAD yields the state-of-the-art performance. It is worth to
mention that we achieve a 65.4\% classification accuracy-- which is, to the
best of our knowledge, the highest record by far --on Flickr Material Database
by using a single feature. Moreover, by combining LOAD with the feature
extracted by Convolutional Neural Networks (CNN), we obtain significantly
better performance than both the LOAD and CNN. This result confirms that the
LOAD is complementary to the learning-based features.Comment: 13 pages, 7 figure
- …