2,703 research outputs found
Comparator Networks
The objective of this work is set-based verification, e.g. to decide if two
sets of images of a face are of the same person or not. The traditional
approach to this problem is to learn to generate a feature vector per image,
aggregate them into one vector to represent the set, and then compute the
cosine similarity between sets. Instead, we design a neural network
architecture that can directly learn set-wise verification. Our contributions
are: (i) We propose a Deep Comparator Network (DCN) that can ingest a pair of
sets (each may contain a variable number of images) as inputs, and compute a
similarity between the pair--this involves attending to multiple discriminative
local regions (landmarks), and comparing local descriptors between pairs of
faces; (ii) To encourage high-quality representations for each set, internal
competition is introduced for recalibration based on the landmark score; (iii)
Inspired by image retrieval, a novel hard sample mining regime is proposed to
control the sampling process, such that the DCN is complementary to the
standard image classification models. Evaluations on the IARPA Janus face
recognition benchmarks show that the comparator networks outperform the
previous state-of-the-art results by a large margin.Comment: To appear in ECCV 201
LOMo: Latent Ordinal Model for Facial Analysis in Videos
We study the problem of facial analysis in videos. We propose a novel weakly
supervised learning method that models the video event (expression, pain etc.)
as a sequence of automatically mined, discriminative sub-events (eg. onset and
offset phase for smile, brow lower and cheek raise for pain). The proposed
model is inspired by the recent works on Multiple Instance Learning and latent
SVM/HCRF- it extends such frameworks to model the ordinal or temporal aspect in
the videos, approximately. We obtain consistent improvements over relevant
competitive baselines on four challenging and publicly available video based
facial analysis datasets for prediction of expression, clinical pain and intent
in dyadic conversations. In combination with complimentary features, we report
state-of-the-art results on these datasets.Comment: 2016 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR
One-to-many face recognition with bilinear CNNs
The recent explosive growth in convolutional neural network (CNN) research
has produced a variety of new architectures for deep learning. One intriguing
new architecture is the bilinear CNN (B-CNN), which has shown dramatic
performance gains on certain fine-grained recognition problems [15]. We apply
this new CNN to the challenging new face recognition benchmark, the IARPA Janus
Benchmark A (IJB-A) [12]. It features faces from a large number of identities
in challenging real-world conditions. Because the face images were not
identified automatically using a computerized face detection system, it does
not have the bias inherent in such a database. We demonstrate the performance
of the B-CNN model beginning from an AlexNet-style network pre-trained on
ImageNet. We then show results for fine-tuning using a moderate-sized and
public external database, FaceScrub [17]. We also present results with
additional fine-tuning on the limited training data provided by the protocol.
In each case, the fine-tuned bilinear model shows substantial improvements over
the standard CNN. Finally, we demonstrate how a standard CNN pre-trained on a
large face database, the recently released VGG-Face model [20], can be
converted into a B-CNN without any additional feature training. This B-CNN
improves upon the CNN performance on the IJB-A benchmark, achieving 89.5%
rank-1 recall.Comment: Published version at WACV 201
Discriminatively Trained Latent Ordinal Model for Video Classification
We study the problem of video classification for facial analysis and human
action recognition. We propose a novel weakly supervised learning method that
models the video as a sequence of automatically mined, discriminative
sub-events (eg. onset and offset phase for "smile", running and jumping for
"highjump"). The proposed model is inspired by the recent works on Multiple
Instance Learning and latent SVM/HCRF -- it extends such frameworks to model
the ordinal aspect in the videos, approximately. We obtain consistent
improvements over relevant competitive baselines on four challenging and
publicly available video based facial analysis datasets for prediction of
expression, clinical pain and intent in dyadic conversations and on three
challenging human action datasets. We also validate the method with qualitative
results and show that they largely support the intuitions behind the method.Comment: Paper accepted in IEEE TPAMI. arXiv admin note: substantial text
overlap with arXiv:1604.0150
Improving Landmark Localization with Semi-Supervised Learning
We present two techniques to improve landmark localization in images from
partially annotated datasets. Our primary goal is to leverage the common
situation where precise landmark locations are only provided for a small data
subset, but where class labels for classification or regression tasks related
to the landmarks are more abundantly available. First, we propose the framework
of sequential multitasking and explore it here through an architecture for
landmark localization where training with class labels acts as an auxiliary
signal to guide the landmark localization on unlabeled data. A key aspect of
our approach is that errors can be backpropagated through a complete landmark
localization model. Second, we propose and explore an unsupervised learning
technique for landmark localization based on having a model predict equivariant
landmarks with respect to transformations applied to the image. We show that
these techniques, improve landmark prediction considerably and can learn
effective detectors even when only a small fraction of the dataset has landmark
labels. We present results on two toy datasets and four real datasets, with
hands and faces, and report new state-of-the-art on two datasets in the wild,
e.g. with only 5\% of labeled images we outperform previous state-of-the-art
trained on the AFLW dataset.Comment: Published as a conference paper in CVPR 201
Multilinear methods for disentangling variations with applications to facial analysis
Several factors contribute to the appearance of an object in a visual scene, including pose,
illumination, and deformation, among others. Each factor accounts for a source of variability
in the data. It is assumed that the multiplicative interactions of these factors emulate the
entangled variability, giving rise to the rich structure of visual object appearance. Disentangling
such unobserved factors from visual data is a challenging task, especially when the data have
been captured in uncontrolled recording conditions (also referred to as “in-the-wild”) and label
information is not available. The work presented in this thesis focuses on disentangling the
variations contained in visual data, in particular applied to 2D and 3D faces. The motivation
behind this work lies in recent developments in the field, such as (i) the creation of large, visual
databases for face analysis, with (ii) the need of extracting information without the use of labels
and (iii) the need to deploy systems under demanding, real-world conditions.
In the first part of this thesis, we present a method to synthesise plausible 3D expressions
that preserve the identity of a target subject. This method is supervised as the model uses
labels, in this case 3D facial meshes of people performing a defined set of facial expressions, to
learn. The ability to synthesise an entire facial rig from a single neutral expression has a large
range of applications both in computer graphics and computer vision, ranging from the ecient
and cost-e↵ective creation of CG characters to scalable data generation for machine learning
purposes. Unlike previous methods based on multilinear models, the proposed approach is
capable to extrapolate well outside the sample pool, which allows it to accurately reproduce
the identity of the target subject and create artefact-free expression shapes while requiring
only a small input dataset. We introduce global-local multilinear models that leverage the
strengths of expression-specific and identity-specific local models combined with coarse motion
estimations from a global model. The expression-specific and identity-specific local models
are built from di↵erent slices of the patch-wise local multilinear model. Experimental results
show that we achieve high-quality, identity-preserving facial expression synthesis results that
outperform existing methods both quantitatively and qualitatively.
In the second part of this thesis, we investigate how the modes of variations from visual data
can be extracted. Our assumption is that visual data has an underlying structure consisting of
factors of variation and their interactions. Finding this structure and the factors is important
as it would not only help us to better understand visual data but once obtained we can edit the factors for use in various applications. Shape from Shading and expression transfer are just two
of the potential applications. To extract the factors of variation, several supervised methods
have been proposed but they require both labels regarding the modes of variations and the same
number of samples under all modes of variations. Therefore, their applicability is limited to
well-organised data, usually captured in well-controlled conditions. We propose a novel general
multilinear matrix decomposition method that discovers the multilinear structure of possibly
incomplete sets of visual data in unsupervised setting. We demonstrate the applicability of the
proposed method in several computer vision tasks, including Shape from Shading (SfS) (in the
wild and with occlusion removal), expression transfer, and estimation of surface normals from
images captured in the wild.
Finally, leveraging the unsupervised multilinear method proposed as well as recent advances in
deep learning, we propose a weakly supervised deep learning method for disentangling multiple
latent factors of variation in face images captured in-the-wild. To this end, we propose a deep
latent variable model, where we model the multiplicative interactions of multiple latent factors
of variation explicitly as a multilinear structure. We demonstrate that the proposed approach
indeed learns disentangled representations of facial expressions and pose, which can be used in
various applications, including face editing, as well as 3D face reconstruction and classification
of facial expression, identity and pose.Open Acces
Human metrology for person classification and recognition
Human metrological features generally refers to geometric measurements extracted from humans, such as height, chest circumference or foot length. Human metrology provides an important soft biometric that can be used in challenging situations, such as person classification and recognition at a distance, where hard biometric traits such as fingerprints and iris information cannot easily be acquired. In this work, we first study the question of predictability and correlation in human metrology. We show that partial or available measurements can be used to predict other missing measurements. We then investigate the use of human metrology for the prediction of other soft biometrics, viz. gender and weight. The experimental results based on our proposed copula-based model suggest that human body metrology contains enough information for reliable prediction of gender and weight. Also, the proposed copula-based technique is observed to reduce the impact of noise on prediction performance. We then study the question of whether face metrology can be exploited for reliable gender prediction. A new method based solely on metrological information from facial landmarks is developed. The performance of the proposed metrology-based method is compared with that of a state-of-the-art appearance-based method for gender classification. Results on several face databases show that the metrology-based approach resulted in comparable accuracy to that of the appearance-based method. Furthermore, we study the question of person recognition (classification and identification) via whole body metrology. Using CAESAR 1D database as baseline, we simulate intra-class variation with various noise models. The experimental results indicate that given enough number of features, our metrology-based recognition system can have promising performance that is comparable to several recent state-of-the-art recognition systems. We propose a non-parametric feature selection methodology, called adapted k-nearest neighbor estimator, which does not rely on intra-class distribution of the query set. This leads to improved results over other nearest neighbor estimators (as feature selection criteria) for moderate number of features. Finally we quantify the discrimination capability of human metrology, from both individuality and capacity perspectives. Generally, a biometric-based recognition technique relies on an assumption that the given biometric is unique to an individual. However, the validity of this assumption is not yet generally confirmed for most soft biometrics, such as human metrology. In this work, we first develop two schemes that can be used to quantify the individuality of a given soft-biometric system. Then, a Poisson channel model is proposed to analyze the recognition capacity of human metrology. Our study suggests that the performance of such a system depends more on the accuracy of the ground truth or training set
Evaluating soft biometrics in the context of face recognition
2013 Summer.Includes bibliographical references.Soft biometrics typically refer to attributes of people such as their gender, the shape of their head, the color of their hair, etc. There is growing interest in soft biometrics as a means of improving automated face recognition since they hold the promise of significantly reducing recognition errors, in part by ruling out illogical choices. Here four experiments quantify performance gains on a difficult face recognition task when standard face recognition algorithms are augmented using information associated with soft biometrics. These experiments include a best-case analysis using perfect knowledge of gender and race, support vector machine-based soft biometric classifiers, face shape expressed through an active shape model, and finally appearance information from the image region directly surrounding the face. All four experiments indicate small improvements may be made when soft biometrics augment an existing algorithm. However, in all cases, the gains were modest. In the context of face recognition, empirical evidence suggests that significant gains using soft biometrics are hard to come by
- …