896 research outputs found

    CNN-based Real-time Dense Face Reconstruction with Inverse-rendered Photo-realistic Face Images

    Full text link
    With the powerfulness of convolution neural networks (CNN), CNN based face reconstruction has recently shown promising performance in reconstructing detailed face shape from 2D face images. The success of CNN-based methods relies on a large number of labeled data. The state-of-the-art synthesizes such data using a coarse morphable face model, which however has difficulty to generate detailed photo-realistic images of faces (with wrinkles). This paper presents a novel face data generation method. Specifically, we render a large number of photo-realistic face images with different attributes based on inverse rendering. Furthermore, we construct a fine-detailed face image dataset by transferring different scales of details from one image to another. We also construct a large number of video-type adjacent frame pairs by simulating the distribution of real video data. With these nicely constructed datasets, we propose a coarse-to-fine learning framework consisting of three convolutional networks. The networks are trained for real-time detailed 3D face reconstruction from monocular video as well as from a single image. Extensive experimental results demonstrate that our framework can produce high-quality reconstruction but with much less computation time compared to the state-of-the-art. Moreover, our method is robust to pose, expression and lighting due to the diversity of data.Comment: Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence, 201

    Neural Face Editing with Intrinsic Image Disentangling

    Full text link
    Traditional face editing methods often require a number of sophisticated and task specific algorithms to be applied one after the other --- a process that is tedious, fragile, and computationally intensive. In this paper, we propose an end-to-end generative adversarial network that infers a face-specific disentangled representation of intrinsic face properties, including shape (i.e. normals), albedo, and lighting, and an alpha matte. We show that this network can be trained on "in-the-wild" images by incorporating an in-network physically-based image formation module and appropriate loss functions. Our disentangling latent representation allows for semantically relevant edits, where one aspect of facial appearance can be manipulated while keeping orthogonal properties fixed, and we demonstrate its use for a number of facial editing applications.Comment: CVPR 2017 ora

    From 3D Point Clouds to Pose-Normalised Depth Maps

    Get PDF
    We consider the problem of generating either pairwise-aligned or pose-normalised depth maps from noisy 3D point clouds in a relatively unrestricted poses. Our system is deployed in a 3D face alignment application and consists of the following four stages: (i) data filtering, (ii) nose tip identification and sub-vertex localisation, (iii) computation of the (relative) face orientation, (iv) generation of either a pose aligned or a pose normalised depth map. We generate an implicit radial basis function (RBF) model of the facial surface and this is employed within all four stages of the process. For example, in stage (ii), construction of novel invariant features is based on sampling this RBF over a set of concentric spheres to give a spherically-sampled RBF (SSR) shape histogram. In stage (iii), a second novel descriptor, called an isoradius contour curvature signal, is defined, which allows rotational alignment to be determined using a simple process of 1D correlation. We test our system on both the University of York (UoY) 3D face dataset and the Face Recognition Grand Challenge (FRGC) 3D data. For the more challenging UoY data, our SSR descriptors significantly outperform three variants of spin images, successfully identifying nose vertices at a rate of 99.6%. Nose localisation performance on the higher quality FRGC data, which has only small pose variations, is 99.9%. Our best system successfully normalises the pose of 3D faces at rates of 99.1% (UoY data) and 99.6% (FRGC data)

    Phenomenological modeling of image irradiance for non-Lambertian surfaces under natural illumination.

    Get PDF
    Various vision tasks are usually confronted by appearance variations due to changes of illumination. For instance, in a recognition system, it has been shown that the variability in human face appearance is owed to changes to lighting conditions rather than person\u27s identity. Theoretically, due to the arbitrariness of the lighting function, the space of all possible images of a fixed-pose object under all possible illumination conditions is infinite dimensional. Nonetheless, it has been proven that the set of images of a convex Lambertian surface under distant illumination lies near a low dimensional linear subspace. This result was also extended to include non-Lambertian objects with non-convex geometry. As such, vision applications, concerned with the recovery of illumination, reflectance or surface geometry from images, would benefit from a low-dimensional generative model which captures appearance variations w.r.t. illumination conditions and surface reflectance properties. This enables the formulation of such inverse problems as parameter estimation. Typically, subspace construction boils to performing a dimensionality reduction scheme, e.g. Principal Component Analysis (PCA), on a large set of (real/synthesized) images of object(s) of interest with fixed pose but different illumination conditions. However, this approach has two major problems. First, the acquired/rendered image ensemble should be statistically significant vis-a-vis capturing the full behavior of the sources of variations that is of interest, in particular illumination and reflectance. Second, the curse of dimensionality hinders numerical methods such as Singular Value Decomposition (SVD) which becomes intractable especially with large number of large-sized realizations in the image ensemble. One way to bypass the need of large image ensemble is to construct appearance subspaces using phenomenological models which capture appearance variations through mathematical abstraction of the reflection process. In particular, the harmonic expansion of the image irradiance equation can be used to derive an analytic subspace to represent images under fixed pose but different illumination conditions where the image irradiance equation has been formulated in a convolution framework. Due to their low-frequency nature, irradiance signals can be represented using low-order basis functions, where Spherical Harmonics (SH) has been extensively adopted. Typically, an ideal solution to the image irradiance (appearance) modeling problem should be able to incorporate complex illumination, cast shadows as well as realistic surface reflectance properties, while moving away from the simplifying assumptions of Lambertian reflectance and single-source distant illumination. By handling arbitrary complex illumination and non-Lambertian reflectance, the appearance model proposed in this dissertation moves the state of the art closer to the ideal solution. This work primarily addresses the geometrical compliance of the hemispherical basis for representing surface reflectance while presenting a compact, yet accurate representation for arbitrary materials. To maintain the plausibility of the resulting appearance, the proposed basis is constructed in a manner that satisfies the Helmholtz reciprocity property while avoiding high computational complexity. It is believed that having the illumination and surface reflectance represented in the spherical and hemispherical domains respectively, while complying with the physical properties of the surface reflectance would provide better approximation accuracy of image irradiance when compared to the representation in the spherical domain. Discounting subsurface scattering and surface emittance, this work proposes a surface reflectance basis, based on hemispherical harmonics (HSH), defined on the Cartesian product of the incoming and outgoing local hemispheres (i.e. w.r.t. surface points). This basis obeys physical properties of surface reflectance involving reciprocity and energy conservation. The basis functions are validated using analytical reflectance models as well as scattered reflectance measurements which might violate the Helmholtz reciprocity property (this can be filtered out through the process of projecting them on the subspace spanned by the proposed basis, where the reciprocity property is preserved in the least-squares sense). The image formation process of isotropic surfaces under arbitrary distant illumination is also formulated in the frequency space where the orthogonality relation between illumination and reflectance bases is encoded in what is termed as irradiance harmonics. Such harmonics decouple the effect of illumination and reflectance from the underlying pose and geometry. Further, a bilinear approach to analytically construct irradiance subspace is proposed in order to tackle the inherent problem of small-sample-size and curse of dimensionality. The process of finding the analytic subspace is posed as establishing a relation between its principal components and that of the irradiance harmonics basis functions. It is also shown how to incorporate prior information about natural illumination and real-world surface reflectance characteristics in order to capture the full behavior of complex illumination and non-Lambertian reflectance. The use of the presented theoretical framework to develop practical algorithms for shape recovery is further presented where the hitherto assumed Lambertian assumption is relaxed. With a single image of unknown general illumination, the underlying geometrical structure can be recovered while accounting explicitly for object reflectance characteristics (e.g. human skin types for facial images and teeth reflectance for human jaw reconstruction) as well as complex illumination conditions. Experiments on synthetic and real images illustrate the robustness of the proposed appearance model vis-a-vis illumination variation. Keywords: computer vision, computer graphics, shading, illumination modeling, reflectance representation, image irradiance, frequency space representations, {hemi)spherical harmonics, analytic bilinear PCA, model-based bilinear PCA, 3D shape reconstruction, statistical shape from shading

    A model of brightness variations due to illumination changes and non-rigid motion using spherical harmonics

    Full text link
    Pixel brightness variations in an image sequence depend both on the objects ‘surface reflectance and on the motion of the camera and object. In the case of rigid shapes some proposed models have been very successful explaining the relation among these strongly coupled components. On the other hand, shapes which deform pose new challenges since the relation between pixel brightness variation with non-rigid motion is not yet clear. In this paper, we introduce a new model which describes brightness variations with two independent components represented as linear basis shapes. Lighting influence is represented in terms of Spherical Harmonics and non-rigid motion as a linear model which represents image coordinates displacement. We then propose an efficient procedure for the estimation of this image model in two distinct steps. First, shape normal’s and albedo are estimated using standard photometric stereo on a sequence with varying lighting and no deformable motion. Then, given the knowledge of the object’s shape normal’s and albedo, we efficiently compute the 2D coordinates bases by minimizing image pixel residuals over an image sequence with constant lighting and only non-rigid motion. Experiments on real tests show the effectiveness of our approach in a face modelling context

    Multilinear methods for disentangling variations with applications to facial analysis

    Get PDF
    Several factors contribute to the appearance of an object in a visual scene, including pose, illumination, and deformation, among others. Each factor accounts for a source of variability in the data. It is assumed that the multiplicative interactions of these factors emulate the entangled variability, giving rise to the rich structure of visual object appearance. Disentangling such unobserved factors from visual data is a challenging task, especially when the data have been captured in uncontrolled recording conditions (also referred to as “in-the-wild”) and label information is not available. The work presented in this thesis focuses on disentangling the variations contained in visual data, in particular applied to 2D and 3D faces. The motivation behind this work lies in recent developments in the field, such as (i) the creation of large, visual databases for face analysis, with (ii) the need of extracting information without the use of labels and (iii) the need to deploy systems under demanding, real-world conditions. In the first part of this thesis, we present a method to synthesise plausible 3D expressions that preserve the identity of a target subject. This method is supervised as the model uses labels, in this case 3D facial meshes of people performing a defined set of facial expressions, to learn. The ability to synthesise an entire facial rig from a single neutral expression has a large range of applications both in computer graphics and computer vision, ranging from the ecient and cost-e↵ective creation of CG characters to scalable data generation for machine learning purposes. Unlike previous methods based on multilinear models, the proposed approach is capable to extrapolate well outside the sample pool, which allows it to accurately reproduce the identity of the target subject and create artefact-free expression shapes while requiring only a small input dataset. We introduce global-local multilinear models that leverage the strengths of expression-specific and identity-specific local models combined with coarse motion estimations from a global model. The expression-specific and identity-specific local models are built from di↵erent slices of the patch-wise local multilinear model. Experimental results show that we achieve high-quality, identity-preserving facial expression synthesis results that outperform existing methods both quantitatively and qualitatively. In the second part of this thesis, we investigate how the modes of variations from visual data can be extracted. Our assumption is that visual data has an underlying structure consisting of factors of variation and their interactions. Finding this structure and the factors is important as it would not only help us to better understand visual data but once obtained we can edit the factors for use in various applications. Shape from Shading and expression transfer are just two of the potential applications. To extract the factors of variation, several supervised methods have been proposed but they require both labels regarding the modes of variations and the same number of samples under all modes of variations. Therefore, their applicability is limited to well-organised data, usually captured in well-controlled conditions. We propose a novel general multilinear matrix decomposition method that discovers the multilinear structure of possibly incomplete sets of visual data in unsupervised setting. We demonstrate the applicability of the proposed method in several computer vision tasks, including Shape from Shading (SfS) (in the wild and with occlusion removal), expression transfer, and estimation of surface normals from images captured in the wild. Finally, leveraging the unsupervised multilinear method proposed as well as recent advances in deep learning, we propose a weakly supervised deep learning method for disentangling multiple latent factors of variation in face images captured in-the-wild. To this end, we propose a deep latent variable model, where we model the multiplicative interactions of multiple latent factors of variation explicitly as a multilinear structure. We demonstrate that the proposed approach indeed learns disentangled representations of facial expressions and pose, which can be used in various applications, including face editing, as well as 3D face reconstruction and classification of facial expression, identity and pose.Open Acces

    View Synthesis from Image and Video for Object Recognition Applications

    Get PDF
    Object recognition is one of the most important and successful applications in computer vision community. The varying appearances of the test object due to different poses or illumination conditions can make the object recognition problem very challenging. Using view synthesis techniques to generate pose-invariant or illumination-invariant images or videos of the test object is an appealing approach to alleviate the degrading recognition performance due to non-canonical views or lighting conditions. In this thesis, we first present a complete framework for better synthesis and understanding of the human pose from a limited number of available silhouette images. Pose-normalized silhouette images are generated using an active virtual camera and an image based visual hull technique, with the silhouette turning function distance being used as the pose similarity measurement. In order to overcome the inability of the shape from silhouettes method to reonstruct concave regions for human postures, a view synthesis algorithm is proposed for articulating humans using visual hull and contour-based body part segmentation. These two components improve each other for better performance through the correspondence across viewpoints built via the inner distance shape context measurement. Face recognition under varying pose is a challenging problem, especially when illumination variations are also present. We propose two algorithms to address this scenario. For a single light source, we demonstrate a pose-normalized face synthesis approach on a pixel-by-pixel basis from a single view by exploiting the bilateral symmetry of the human face. For more complicated illumination condition, the spherical harmonic representation is extended to encode pose information. An efficient method is proposed for robust face synthesis and recognition with a very compact training set. Finally, we present an end-to-end moving object verification system for airborne video, wherein a homography based view synthesis algorithm is used to simultaneously handle the object's changes in aspect angle, depression angle, and resolution. Efficient integration of spatial and temporal model matching assures the robustness of the verification step. As a byproduct, a robust two camera tracking method using homography is also proposed and demonstrated using challenging surveillance video sequences

    VISUAL TRACKING AND ILLUMINATION RECOVERY VIA SPARSE REPRESENTATION

    Get PDF
    Compressive sensing, or sparse representation, has played a fundamental role in many fields of science. It shows that the signals and images can be reconstructed from far fewer measurements than what is usually considered to be necessary. Sparsity leads to efficient estimation, efficient compression, dimensionality reduction, and efficient modeling. Recently, there has been a growing interest in compressive sensing in computer vision and it has been successfully applied to face recognition, background subtraction, object tracking and other problems. Sparsity can be achieved by solving the compressive sensing problem using L1 minimization. In this dissertation, we present the results of a study of applying sparse representation to illumination recovery, object tracking, and simultaneous tracking and recognition. Illumination recovery, also known as inverse lighting, is the problem of recovering an illumination distribution in a scene from the appearance of objects located in the scene. It is used for Augmented Reality, where the virtual objects match the existing image and cast convincing shadows on the real scene rendered with the recovered illumination. Shadows in a scene are caused by the occlusion of incoming light, and thus contain information about the lighting of the scene. Although shadows have been used in determining the 3D shape of the object that casts shadows onto the scene, few studies have focused on the illumination information provided by the shadows. In this dissertation, we recover the illumination of a scene from a single image with cast shadows given the geometry of the scene. The images with cast shadows can be quite complex and therefore cannot be well approximated by low-dimensional linear subspaces. However, in this study we show that the set of images produced by a Lambertian scene with cast shadows can be efficiently represented by a sparse set of images generated by directional light sources. We first model an image with cast shadows as composed of a diffusive part (without cast shadows) and a residual part that captures cast shadows. Then, we express the problem in an L1-regularized least squares formulation, with nonnegativity constraints (as light has to be nonnegative at any point in space). This sparse representation enjoys an effective and fast solution, thanks to recent advances in compressive sensing. In experiments on both synthetic and real data, our approach performs favorably in comparison to several previously proposed methods. Visual tracking, which consistently infers the motion of a desired target in a video sequence, has been an active and fruitful research topic in computer vision for decades. It has many practical applications such as surveillance, human computer interaction, medical imaging and so on. Many challenges to design a robust tracking algorithm come from the enormous unpredictable variations in the target, such as deformations, fast motion, occlusions, background clutter, and lighting changes. To tackle the challenges posed by tracking, we propose a robust visual tracking method by casting tracking as a sparse approximation problem in a particle filter framework. In this framework, occlusion, noise and other challenging issues are addressed seamlessly through a set of trivial templates. Specifically, to find the tracking target at a new frame, each target candidate is sparsely represented in the space spanned by target templates and trivial templates. The sparsity is achieved by solving an L1-regularized least squares problem. Then the candidate with the smallest projection error is taken as the tracking target. After that, tracking is continued using a Bayesian state inference framework in which a particle filter is used for propagating sample distributions over time. Three additional components further improve the robustness of our approach: 1) a velocity incorporated motion model that helps concentrate the samples on the true target location in the next frame, 2) the nonnegativity constraints that help filter out clutter that is similar to tracked targets in reversed intensity patterns, and 3) a dynamic template update scheme that keeps track of the most representative templates throughout the tracking procedure. We test the proposed approach on many challenging sequences involving heavy occlusions, drastic illumination changes, large scale changes, non-rigid object movement, out-of-plane rotation, and large pose variations. The proposed approach shows excellent performance in comparison with four previously proposed trackers. We also extend the work to simultaneous tracking and recognition in vehicle classification in IR video sequences. We attempt to resolve the uncertainties in tracking and recognition at the same time by introducing a static template set that stores target images in various conditions such as different poses, lighting, and so on. The recognition results at each frame are propagated to produce the final result for the whole video. The tracking result is evaluated at each frame and low confidence in tracking performance initiates a new cycle of tracking and classification. We demonstrate the robustness of the proposed method on vehicle tracking and classification using outdoor IR video sequences
    corecore