8 research outputs found
Conditional Adversarial Synthesis of 3D Facial Action Units
Employing deep learning-based approaches for fine-grained facial expression
analysis, such as those involving the estimation of Action Unit (AU)
intensities, is difficult due to the lack of a large-scale dataset of real
faces with sufficiently diverse AU labels for training. In this paper, we
consider how AU-level facial image synthesis can be used to substantially
augment such a dataset. We propose an AU synthesis framework that combines the
well-known 3D Morphable Model (3DMM), which intrinsically disentangles
expression parameters from other face attributes, with models that
adversarially generate 3DMM expression parameters conditioned on given target
AU labels, in contrast to the more conventional approach of generating facial
images directly. In this way, we are able to synthesize new combinations of
expression parameters and facial images from desired AU labels. Extensive
quantitative and qualitative results on the benchmark DISFA dataset demonstrate
the effectiveness of our method on 3DMM facial expression parameter synthesis
and data augmentation for deep learning-based AU intensity estimation
Unsupervised Learning Facial Parameter Regressor for Action Unit Intensity Estimation via Differentiable Renderer
Facial action unit (AU) intensity is an index to describe all visually
discernible facial movements. Most existing methods learn intensity estimator
with limited AU data, while they lack generalization ability out of the
dataset. In this paper, we present a framework to predict the facial parameters
(including identity parameters and AU parameters) based on a bone-driven face
model (BDFM) under different views. The proposed framework consists of a
feature extractor, a generator, and a facial parameter regressor. The regressor
can fit the physical meaning parameters of the BDFM from a single face image
with the help of the generator, which maps the facial parameters to the
game-face images as a differentiable renderer. Besides, identity loss, loopback
loss, and adversarial loss can improve the regressive results. Quantitative
evaluations are performed on two public databases BP4D and DISFA, which
demonstrates that the proposed method can achieve comparable or better
performance than the state-of-the-art methods. What's more, the qualitative
results also demonstrate the validity of our method in the wild
感性推定のためのDeep Learning による特徴抽出
広島大学(Hiroshima University)博士(工学)Doctor of Engineeringdoctora
Image-set, Temporal and Spatiotemporal Representations of Videos for Recognizing, Localizing and Quantifying Actions
This dissertation addresses the problem of learning video representations, which is defined here as transforming the video so that its essential structure is made more visible or accessible for action recognition and quantification. In the literature, a video can be represented by a set of images, by modeling motion or temporal dynamics, and by a 3D graph with pixels as nodes. This dissertation contributes in proposing a set of models to localize, track, segment, recognize and assess actions such as (1) image-set models via aggregating subset features given by regularizing normalized CNNs, (2) image-set models via inter-frame principal recovery and sparsely coding residual actions, (3) temporally local models with spatially global motion estimated by robust feature matching and local motion estimated by action detection with motion model added, (4) spatiotemporal models 3D graph and 3D CNN to model time as a space dimension, (5) supervised hashing by jointly learning embedding and quantization, respectively. State-of-the-art performances are achieved for tasks such as quantifying facial pain and human diving. Primary conclusions of this dissertation are categorized as follows: (i) Image set can capture facial actions that are about collective representation; (ii) Sparse and low-rank representations can have the expression, identity and pose cues untangled and can be learned via an image-set model and also a linear model; (iii) Norm is related with recognizability; similarity metrics and loss functions matter; (v) Combining the MIL based boosting tracker with the Particle Filter motion model induces a good trade-off between the appearance similarity and motion consistence; (iv) Segmenting object locally makes it amenable to assign shape priors; it is feasible to learn knowledge such as shape priors online from Web data with weak supervision; (v) It works locally in both space and time to represent videos as 3D graphs; 3D CNNs work effectively when inputted with temporally meaningful clips; (vi) the rich labeled images or videos help to learn better hash functions after learning binary embedded codes than the random projections. In addition, models proposed for videos can be adapted to other sequential images such as volumetric medical images which are not included in this dissertation