52 research outputs found
Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations
The key idea behind the unsupervised learning of disentangled representations
is that real-world data is generated by a few explanatory factors of variation
which can be recovered by unsupervised learning algorithms. In this paper, we
provide a sober look at recent progress in the field and challenge some common
assumptions. We first theoretically show that the unsupervised learning of
disentangled representations is fundamentally impossible without inductive
biases on both the models and the data. Then, we train more than 12000 models
covering most prominent methods and evaluation metrics in a reproducible
large-scale experimental study on seven different data sets. We observe that
while the different methods successfully enforce properties ``encouraged'' by
the corresponding losses, well-disentangled models seemingly cannot be
identified without supervision. Furthermore, increased disentanglement does not
seem to lead to a decreased sample complexity of learning for downstream tasks.
Our results suggest that future work on disentanglement learning should be
explicit about the role of inductive biases and (implicit) supervision,
investigate concrete benefits of enforcing disentanglement of the learned
representations, and consider a reproducible experimental setup covering
several data sets
3D Representation Learning for Shape Reconstruction and Understanding
The real world we are living in is inherently composed of multiple 3D objects. However, most of the existing works in computer vision traditionally either focus on images or videos where the 3D information inevitably gets lost due to the camera projection. Traditional methods typically rely on hand-crafted algorithms and features with many constraints and geometric priors to understand the real world. However, following the trend of deep learning, there has been an exponential growth in the number of research works based on deep neural networks to learn 3D representations for complex shapes and scenes, which lead to many cutting-edged applications in augmented reality (AR), virtual reality (VR) and robotics as one of the most important directions for computer vision and computer graphics.
This thesis aims to build an intelligent system with dynamic 3D representations that can change over time to understand and recover the real world with semantic, instance and geometric information and eventually bridge the gap between the real world and the digital world. As the first step towards the challenges, this thesis explores both explicit representations and implicit representations by explicitly addressing the existing open problems in these areas. This thesis starts from neural implicit representation learning on 3D scene representation learning and understanding and moves to a parametric model based explicit 3D reconstruction method. Extensive experimentation over various benchmarks on various domains demonstrates the superiority of our method against previous state-of-the-art approaches, enabling many applications in the real world. Based on the proposed methods and current observations of open problems, this thesis finally presents a comprehensive conclusion with potential future research directions
Modeling Caricature Expressions by 3D Blendshape and Dynamic Texture
The problem of deforming an artist-drawn caricature according to a given
normal face expression is of interest in applications such as social media,
animation and entertainment. This paper presents a solution to the problem,
with an emphasis on enhancing the ability to create desired expressions and
meanwhile preserve the identity exaggeration style of the caricature, which
imposes challenges due to the complicated nature of caricatures. The key of our
solution is a novel method to model caricature expression, which extends
traditional 3DMM representation to caricature domain. The method consists of
shape modelling and texture generation for caricatures. Geometric optimization
is developed to create identity-preserving blendshapes for reconstructing
accurate and stable geometric shape, and a conditional generative adversarial
network (cGAN) is designed for generating dynamic textures under target
expressions. The combination of both shape and texture components makes the
non-trivial expressions of a caricature be effectively defined by the extension
of the popular 3DMM representation and a caricature can thus be flexibly
deformed into arbitrary expressions with good results visually in both shape
and color spaces. The experiments demonstrate the effectiveness of the proposed
method.Comment: Accepted by the 28th ACM International Conference on Multimedia (ACM
MM 2020
Model-based occlusion disentanglement for image-to-image translation
Image-to-image translation is affected by entanglement phenomena, which may
occur in case of target data encompassing occlusions such as raindrops, dirt,
etc. Our unsupervised model-based learning disentangles scene and occlusions,
while benefiting from an adversarial pipeline to regress physical parameters of
the occlusion model. The experiments demonstrate our method is able to handle
varying types of occlusions and generate highly realistic translations,
qualitatively and quantitatively outperforming the state-of-the-art on multiple
datasets.Comment: ECCV 202
Improving Facial Analysis and Performance Driven Animation through Disentangling Identity and Expression
We present techniques for improving performance driven facial animation,
emotion recognition, and facial key-point or landmark prediction using learned
identity invariant representations. Established approaches to these problems
can work well if sufficient examples and labels for a particular identity are
available and factors of variation are highly controlled. However, labeled
examples of facial expressions, emotions and key-points for new individuals are
difficult and costly to obtain. In this paper we improve the ability of
techniques to generalize to new and unseen individuals by explicitly modeling
previously seen variations related to identity and expression. We use a
weakly-supervised approach in which identity labels are used to learn the
different factors of variation linked to identity separately from factors
related to expression. We show how probabilistic modeling of these sources of
variation allows one to learn identity-invariant representations for
expressions which can then be used to identity-normalize various procedures for
facial expression analysis and animation control. We also show how to extend
the widely used techniques of active appearance models and constrained local
models through replacing the underlying point distribution models which are
typically constructed using principal component analysis with
identity-expression factorized representations. We present a wide variety of
experiments in which we consistently improve performance on emotion
recognition, markerless performance-driven facial animation and facial
key-point tracking.Comment: to appear in Image and Vision Computing Journal (IMAVIS
- …