837 research outputs found
LOMo: Latent Ordinal Model for Facial Analysis in Videos
We study the problem of facial analysis in videos. We propose a novel weakly
supervised learning method that models the video event (expression, pain etc.)
as a sequence of automatically mined, discriminative sub-events (eg. onset and
offset phase for smile, brow lower and cheek raise for pain). The proposed
model is inspired by the recent works on Multiple Instance Learning and latent
SVM/HCRF- it extends such frameworks to model the ordinal or temporal aspect in
the videos, approximately. We obtain consistent improvements over relevant
competitive baselines on four challenging and publicly available video based
facial analysis datasets for prediction of expression, clinical pain and intent
in dyadic conversations. In combination with complimentary features, we report
state-of-the-art results on these datasets.Comment: 2016 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR
Discriminatively Trained Latent Ordinal Model for Video Classification
We study the problem of video classification for facial analysis and human
action recognition. We propose a novel weakly supervised learning method that
models the video as a sequence of automatically mined, discriminative
sub-events (eg. onset and offset phase for "smile", running and jumping for
"highjump"). The proposed model is inspired by the recent works on Multiple
Instance Learning and latent SVM/HCRF -- it extends such frameworks to model
the ordinal aspect in the videos, approximately. We obtain consistent
improvements over relevant competitive baselines on four challenging and
publicly available video based facial analysis datasets for prediction of
expression, clinical pain and intent in dyadic conversations and on three
challenging human action datasets. We also validate the method with qualitative
results and show that they largely support the intuitions behind the method.Comment: Paper accepted in IEEE TPAMI. arXiv admin note: substantial text
overlap with arXiv:1604.0150
Human-controllable and structured deep generative models
Deep generative models are a class of probabilistic models that attempts to learn the underlying data distribution. These models are usually trained in an unsupervised way and thus, do not require any labels. Generative models such as Variational Autoencoders and Generative Adversarial Networks have made astounding progress over the last years. These models have several benefits: eased sampling and evaluation, efficient learning of low-dimensional representations for downstream tasks, and better understanding through interpretable representations. However, even though the quality of these models has improved immensely, the ability to control their style and structure is limited. Structured and human-controllable representations of generative models are essential for human-machine interaction and other applications, including fairness, creativity, and entertainment. This thesis investigates learning human-controllable and structured representations with deep generative models. In particular, we focus on generative modelling of 2D images. For the first part, we focus on learning clustered representations. We propose semi-parametric hierarchical variational autoencoders to estimate the intensity of facial action units. The semi-parametric model forms a hybrid generative-discriminative model and leverages both parametric Variational Autoencoder and non-parametric Gaussian Process autoencoder. We show superior performance in comparison with existing facial action unit estimation approaches. Based on the results and analysis of the learned representation, we focus on learning Mixture-of-Gaussians representations in an autoencoding framework. We deviate from the conventional autoencoding framework and consider a regularized objective with the Cauchy-Schwarz divergence. The Cauchy-Schwarz divergence allows a closed-form solution for Mixture-of-Gaussian distributions and, thus, efficiently optimizing the autoencoding objective. We show that our model outperforms existing Variational Autoencoders in density estimation, clustering, and semi-supervised facial action detection. We focus on learning disentangled representations for conditional generation and fair facial attribute classification for the second part. Conditional image generation relies on the accessibility to large-scale annotated datasets. Nevertheless, the geometry of visual objects, such as in faces, cannot be learned implicitly and deteriorate image fidelity. We propose incorporating facial landmarks with a statistical shape model and a differentiable piecewise affine transformation to separate the representation for appearance and shape. The goal of incorporating facial landmarks is that generation is controlled and can separate different appearances and geometries. In our last work, we use weak supervision for disentangling groups of variations. Works on learning disentangled representation have been done in an unsupervised fashion. However, recent works have shown that learning disentangled representations is not identifiable without any inductive biases. Since then, there has been a shift towards weakly-supervised disentanglement learning. We investigate using regularization based on the Kullback-Leiber divergence to disentangle groups of variations. The goal is to have consistent and separated subspaces for different groups, e.g., for content-style learning. Our evaluation shows increased disentanglement abilities and competitive performance for image clustering and fair facial attribute classification with weak supervision compared to supervised and semi-supervised approaches.Open Acces
DeepCoder: Semi-parametric Variational Autoencoders for Automatic Facial Action Coding
Human face exhibits an inherent hierarchy in its representations (i.e.,
holistic facial expressions can be encoded via a set of facial action units
(AUs) and their intensity). Variational (deep) auto-encoders (VAE) have shown
great results in unsupervised extraction of hierarchical latent representations
from large amounts of image data, while being robust to noise and other
undesired artifacts. Potentially, this makes VAEs a suitable approach for
learning facial features for AU intensity estimation. Yet, most existing
VAE-based methods apply classifiers learned separately from the encoded
features. By contrast, the non-parametric (probabilistic) approaches, such as
Gaussian Processes (GPs), typically outperform their parametric counterparts,
but cannot deal easily with large amounts of data. To this end, we propose a
novel VAE semi-parametric modeling framework, named DeepCoder, which combines
the modeling power of parametric (convolutional) and nonparametric (ordinal
GPs) VAEs, for joint learning of (1) latent representations at multiple levels
in a task hierarchy1, and (2) classification of multiple ordinal outputs. We
show on benchmark datasets for AU intensity estimation that the proposed
DeepCoder outperforms the state-of-the-art approaches, and related VAEs and
deep learning models.Comment: ICCV 2017 - accepte
Most Likely Separation of Intensity and Warping Effects in Image Registration
This paper introduces a class of mixed-effects models for joint modeling of
spatially correlated intensity variation and warping variation in 2D images.
Spatially correlated intensity variation and warp variation are modeled as
random effects, resulting in a nonlinear mixed-effects model that enables
simultaneous estimation of template and model parameters by optimization of the
likelihood function. We propose an algorithm for fitting the model which
alternates estimation of variance parameters and image registration. This
approach avoids the potential estimation bias in the template estimate that
arises when treating registration as a preprocessing step. We apply the model
to datasets of facial images and 2D brain magnetic resonance images to
illustrate the simultaneous estimation and prediction of intensity and warp
effects
LibreFace: An Open-Source Toolkit for Deep Facial Expression Analysis
Facial expression analysis is an important tool for human-computer
interaction. In this paper, we introduce LibreFace, an open-source toolkit for
facial expression analysis. This open-source toolbox offers real-time and
offline analysis of facial behavior through deep learning models, including
facial action unit (AU) detection, AU intensity estimation, and facial
expression recognition. To accomplish this, we employ several techniques,
including the utilization of a large-scale pre-trained network, feature-wise
knowledge distillation, and task-specific fine-tuning. These approaches are
designed to effectively and accurately analyze facial expressions by leveraging
visual information, thereby facilitating the implementation of real-time
interactive applications. In terms of Action Unit (AU) intensity estimation, we
achieve a Pearson Correlation Coefficient (PCC) of 0.63 on DISFA, which is 7%
higher than the performance of OpenFace 2.0 while maintaining highly-efficient
inference that runs two times faster than OpenFace 2.0. Despite being compact,
our model also demonstrates competitive performance to state-of-the-art facial
expression analysis methods on AffecNet, FFHQ, and RAFDB. Our code will be
released at https://github.com/ihp-lab/LibreFaceComment: 10 pages, 5 figures. Accepted by WACV 2024 Round 1. (Application
Track
- …