130 research outputs found
Recovering joint and individual components in facial data
A set of images depicting faces with different expressions or in various ages consists of components that are shared across all images (i.e., joint components) and imparts to the depicted object the properties of human faces and individual components that are related to different expressions or age groups. Discovering the common (joint) and individual components in facial images is crucial for applications such as facial expression transfer. The problem is rather challenging when dealing with images captured in unconstrained conditions and thus are possibly contaminated by sparse non-Gaussian errors of large magnitude (i.e., sparse gross errors) and contain missing data. In this paper, we investigate the use of a method recently introduced in statistics, the so-called Joint and Individual Variance Explained (JIVE) method, for the robust recovery of joint and individual components in visual facial data consisting of an arbitrary number of views. Since, the JIVE is not robust to sparse gross errors, we propose alternatives, which are 1) robust to sparse gross, non-Gaussian noise, 2) able to automatically find the individual components rank, and 3) can handle missing data. We demonstrate the effectiveness of the proposed methods to several computer vision applications, namely facial expression synthesis and 2D and 3D face age progression in-the-wild
Recovering joint and individual components in facial data
A set of images depicting faces with different expressions or in various ages consists of components that are shared across all images (i.e., joint components) and imparts to the depicted object the properties of human faces and individual components that are related to different expressions or age groups. Discovering the common (joint) and individual components in facial images is crucial for applications such as facial expression transfer. The problem is rather challenging when dealing with images captured in unconstrained conditions and thus are possibly contaminated by sparse non-Gaussian errors of large magnitude (i.e., sparse gross errors) and contain missing data. In this paper, we investigate the use of a method recently introduced in statistics, the so-called Joint and Individual Variance Explained (JIVE) method, for the robust recovery of joint and individual components in visual facial data consisting of an arbitrary number of views. Since, the JIVE is not robust to sparse gross errors, we propose alternatives, which are 1) robust to sparse gross, non-Gaussian noise, 2) able to automatically find the individual components rank, and 3) can handle missing data. We demonstrate the effectiveness of the proposed methods to several computer vision applications, namely facial expression synthesis and 2D and 3D face age progression in-the-wild
Recovering joint and individual components in facial data
A set of images depicting faces with different expressions or in various ages consists of components that are shared across all images (i.e., joint components) imparting to the depicted object the properties of human faces as well as individual components that are related to different expressions or age groups. Discovering the common (joint) and individual components in facial images is crucial for applications such as facial expression transfer and age progression. The problem is rather challenging when dealing with images captured in unconstrained conditions in the presence of sparse non-Gaussian errors of large magnitude (i.e., sparse gross errors or outliers) and contain missing data. In this paper, we investigate the use of a method recently introduced in statistics, the so-called Joint and Individual Variance Explained (JIVE) method, for the robust recovery of joint and individual components in visual facial data consisting of an arbitrary number of views. Since the JIVE is not robust to sparse gross errors, we propose alternatives, which are (1) robust to sparse gross, non-Gaussian noise, (2) able to automatically find the individual components rank, and (3) can handle missing data. We demonstrate the effectiveness of the proposed methods to several computer vision applications, namely facial expression synthesis and 2D and 3D face age progression ‘in-the-wild’
Multilinear methods for disentangling variations with applications to facial analysis
Several factors contribute to the appearance of an object in a visual scene, including pose,
illumination, and deformation, among others. Each factor accounts for a source of variability
in the data. It is assumed that the multiplicative interactions of these factors emulate the
entangled variability, giving rise to the rich structure of visual object appearance. Disentangling
such unobserved factors from visual data is a challenging task, especially when the data have
been captured in uncontrolled recording conditions (also referred to as “in-the-wild”) and label
information is not available. The work presented in this thesis focuses on disentangling the
variations contained in visual data, in particular applied to 2D and 3D faces. The motivation
behind this work lies in recent developments in the field, such as (i) the creation of large, visual
databases for face analysis, with (ii) the need of extracting information without the use of labels
and (iii) the need to deploy systems under demanding, real-world conditions.
In the first part of this thesis, we present a method to synthesise plausible 3D expressions
that preserve the identity of a target subject. This method is supervised as the model uses
labels, in this case 3D facial meshes of people performing a defined set of facial expressions, to
learn. The ability to synthesise an entire facial rig from a single neutral expression has a large
range of applications both in computer graphics and computer vision, ranging from the ecient
and cost-e↵ective creation of CG characters to scalable data generation for machine learning
purposes. Unlike previous methods based on multilinear models, the proposed approach is
capable to extrapolate well outside the sample pool, which allows it to accurately reproduce
the identity of the target subject and create artefact-free expression shapes while requiring
only a small input dataset. We introduce global-local multilinear models that leverage the
strengths of expression-specific and identity-specific local models combined with coarse motion
estimations from a global model. The expression-specific and identity-specific local models
are built from di↵erent slices of the patch-wise local multilinear model. Experimental results
show that we achieve high-quality, identity-preserving facial expression synthesis results that
outperform existing methods both quantitatively and qualitatively.
In the second part of this thesis, we investigate how the modes of variations from visual data
can be extracted. Our assumption is that visual data has an underlying structure consisting of
factors of variation and their interactions. Finding this structure and the factors is important
as it would not only help us to better understand visual data but once obtained we can edit the factors for use in various applications. Shape from Shading and expression transfer are just two
of the potential applications. To extract the factors of variation, several supervised methods
have been proposed but they require both labels regarding the modes of variations and the same
number of samples under all modes of variations. Therefore, their applicability is limited to
well-organised data, usually captured in well-controlled conditions. We propose a novel general
multilinear matrix decomposition method that discovers the multilinear structure of possibly
incomplete sets of visual data in unsupervised setting. We demonstrate the applicability of the
proposed method in several computer vision tasks, including Shape from Shading (SfS) (in the
wild and with occlusion removal), expression transfer, and estimation of surface normals from
images captured in the wild.
Finally, leveraging the unsupervised multilinear method proposed as well as recent advances in
deep learning, we propose a weakly supervised deep learning method for disentangling multiple
latent factors of variation in face images captured in-the-wild. To this end, we propose a deep
latent variable model, where we model the multiplicative interactions of multiple latent factors
of variation explicitly as a multilinear structure. We demonstrate that the proposed approach
indeed learns disentangled representations of facial expressions and pose, which can be used in
various applications, including face editing, as well as 3D face reconstruction and classification
of facial expression, identity and pose.Open Acces
Talking Head(?) Anime from a Single Image 4: Improved Model and Its Distillation
We study the problem of creating a character model that can be controlled in
real time from a single image of an anime character. A solution to this problem
would greatly reduce the cost of creating avatars, computer games, and other
interactive applications.
Talking Head Anime 3 (THA3) is an open source project that attempts to
directly address the problem. It takes as input (1) an image of an anime
character's upper body and (2) a 45-dimensional pose vector and outputs a new
image of the same character taking the specified pose. The range of possible
movements is expressive enough for personal avatars and certain types of game
characters. However, the system is too slow to generate animations in real time
on common PCs, and its image quality can be improved.
In this paper, we improve THA3 in two ways. First, we propose new
architectures for constituent networks that rotate the character's head and
body based on U-Nets with attention that are widely used in modern generative
models. The new architectures consistently yield better image quality than the
THA3 baseline. Nevertheless, they also make the whole system much slower: it
takes up to 150 milliseconds to generate a frame. Second, we propose a
technique to distill the system into a small network (less than 2 MB) that can
generate 512x512 animation frames in real time (under 30 FPS) using consumer
gaming GPUs while keeping the image quality close to that of the full system.
This improvement makes the whole system practical for real-time applications
Calipso: Physics-based Image and Video Editing through CAD Model Proxies
We present Calipso, an interactive method for editing images and videos in a
physically-coherent manner. Our main idea is to realize physics-based
manipulations by running a full physics simulation on proxy geometries given by
non-rigidly aligned CAD models. Running these simulations allows us to apply
new, unseen forces to move or deform selected objects, change physical
parameters such as mass or elasticity, or even add entire new objects that
interact with the rest of the underlying scene. In Calipso, the user makes
edits directly in 3D; these edits are processed by the simulation and then
transfered to the target 2D content using shape-to-image correspondences in a
photo-realistic rendering process. To align the CAD models, we introduce an
efficient CAD-to-image alignment procedure that jointly minimizes for rigid and
non-rigid alignment while preserving the high-level structure of the input
shape. Moreover, the user can choose to exploit image flow to estimate scene
motion, producing coherent physical behavior with ambient dynamics. We
demonstrate Calipso's physics-based editing on a wide range of examples
producing myriad physical behavior while preserving geometric and visual
consistency.Comment: 11 page
Side information in robust principal component analysis: algorithms and applications
Dimensionality reduction and noise removal are fundamental machine learning tasks that are vital to artificial intelligence applications. Principal component analysis has long been utilised in computer vision to achieve the above mentioned goals. Recently, it has been enhanced in terms of robustness to outliers in robust principal component analysis. Both convex and non-convex programs have been developed to solve this new formulation, some with exact convergence guarantees. Its effectiveness can be witnessed in image and video applications ranging from image denoising and alignment to background separation and face recognition. However, robust principal component analysis is by no means perfect. This dissertation identifies its limitations, explores various promising options for improvement and validates the proposed algorithms on both synthetic and real-world datasets.
Common algorithms approximate the NP-hard formulation of robust principal component analysis with convex envelopes. Though under certain assumptions exact recovery can be guaranteed, the relaxation margin is too big to be squandered. In this work, we propose to apply gradient descent on the Burer-Monteiro bilinear matrix factorisation to squeeze this margin given available subspaces. This non-convex approach improves upon conventional convex approaches both in terms of accuracy and speed. On the other hand, oftentimes there is accompanying side information when an observation is made. The ability to assimilate such auxiliary sources of data can ameliorate the recovery process. In this work, we investigate in-depth such possibilities for incorporating side information in restoring the true underlining low-rank component from gross sparse noise. Lastly, tensors, also known as multi-dimensional arrays, represent real-world data more naturally than matrices. It is thus advantageous to adapt robust principal component analysis to tensors. Since there is no exact equivalence between tensor rank and matrix rank, we employ the notions of Tucker rank and CP rank as our optimisation objectives. Overall, this dissertation carefully defines the problems when facing real-world computer vision challenges, extensively and impartially evaluates the state-of-the-art approaches, proposes novel solutions and provides sufficient validations on both simulated data and popular real-world datasets for various mainstream computer vision tasks.Open Acces
- …