10 research outputs found

    USING CNNS TO UNDERSTAND LIGHTING WITHOUT REAL LABELED TRAINING DATA

    Get PDF
    The task of computer vision is to make computers understand the physical word through images. Lighting is the medium through which we capture images of the physical world. Without lighting, there is no image, and dierent lighting leads to dierent images of the same physical world. In this dissertation, we study how to understand lighting from images. With the emergence of large datasets and deep learning in recent years, learning based methods play a more and more important role in computer vision, and deep Convolutional Neural Networks (CNNs) now dominate most of the problems in computer vision. Despite their success, deep CNNs are notorious for their data hungry nature compared with traditional learning based methods. While collecting images from the internet is easy and fast, labeling those images is both time consuming and expensive, and sometimes, even impossible. In this work, we focus on understanding lighting from faces and natural scenes, in which ground truth labels of the lighting are impossible to achieve. As a preliminary topic, we rst study the capacity of deep CNNs. Designing deep CNNs with less capacity and good generalization is one way to reduce the amount of labeled data needed in training deep CNNs, and understanding the capacity of deep CNNs is the rst step towards that goal. In this work, we empirically study the capacity of deep CNNs by studying the redundancy of parameters in them. More specically, we aim at optimizing the number of neurons in a network, thus the number of parameters. To achieve that goal, we incorporate sparse constraints into the objective function and apply a forward-backward splitting method to solve this sparse constrained optimization problem eciently. The proposed method can signicantly reduce the number of parameters, showing that networks with small capacity can work well. We then study an important problem in computer vision: inverse lighting from a single face image. Lacking massive ground truth lighting labels, we generate a large amount of synthetic data with ground truth lighting to train a deep network. However, due to the large domain gap between real and synthetic data, the network trained using synthetic data cannot generalize well to real data. We thus propose to use real data to train the deep CNN together with synthetic data. We apply an existing method to estimate lighting conditions of real face images. However, these lighting labels are noisy. We then propose a Label Denoising Adversarial Network (LDAN) to make use of these synthetic data to help train a deep CNN to regress lighting from real face images, denoising labels of real images. We have shown that the proposed method can generate more consistent lighting for faces taken under the same lighting condition. Third, we study how to relight a face image using deep CNNs. We formulate this problem as a supervised image to image translation problem. Due to the lack of a "in the wild" face dataset that is suitable for this task, we apply a physically based face relighting method to generate a large scale, high resolution, "in the wild" portrait relighting dataset (DPR). A deep Convolutional Neural Network (CNN) is then trained using this dataset to generate a relighted portrait image by taking a source image and a target lighting as input. We show that our training procedure can regularize the generated results, removing the artifacts caused by physically-based relighting methods. Fourth, we study how to understand lighting from a natural scene based on an RGB image. We propose a Global-Local Spherical Harmonics (GLoSH) lighting model to improve the lighting representation, and jointly predict refectance and surface normals. The global SH models the holistic lighting while local SHs account for the spatial variation of lighting. A novel non-negative lighting constraint is proposed to encourage the estimated SHs to be physically meaningful. To seamlessly make use of the GLoSH model, we design a coarse-to-ne network structure. Lacking labels for refectance and lighting, we apply synthetic data for model pre-training and fine-tune the model with real data in a self-supervised way. We have shown that the proposed method outperforms state-of-the-art methods in understanding lighting, refectance and shading of a natural scene

    Object-based Illumination Estimation with Rendering-aware Neural Networks

    Full text link
    We present a scheme for fast environment light estimation from the RGBD appearance of individual objects and their local image areas. Conventional inverse rendering is too computationally demanding for real-time applications, and the performance of purely learning-based techniques may be limited by the meager input data available from individual objects. To address these issues, we propose an approach that takes advantage of physical principles from inverse rendering to constrain the solution, while also utilizing neural networks to expedite the more computationally expensive portions of its processing, to increase robustness to noisy input data as well as to improve temporal and spatial stability. This results in a rendering-aware system that estimates the local illumination distribution at an object with high accuracy and in real time. With the estimated lighting, virtual objects can be rendered in AR scenarios with shading that is consistent to the real scene, leading to improved realism.Comment: ECCV 202

    Artificial Intelligence Tools for Facial Expression Analysis.

    Get PDF
    Inner emotions show visibly upon the human face and are understood as a basic guide to an individual’s inner world. It is, therefore, possible to determine a person’s attitudes and the effects of others’ behaviour on their deeper feelings through examining facial expressions. In real world applications, machines that interact with people need strong facial expression recognition. This recognition is seen to hold advantages for varied applications in affective computing, advanced human-computer interaction, security, stress and depression analysis, robotic systems, and machine learning. This thesis starts by proposing a benchmark of dynamic versus static methods for facial Action Unit (AU) detection. AU activation is a set of local individual facial muscle parts that occur in unison constituting a natural facial expression event. Detecting AUs automatically can provide explicit benefits since it considers both static and dynamic facial features. For this research, AU occurrence activation detection was conducted by extracting features (static and dynamic) of both nominal hand-crafted and deep learning representation from each static image of a video. This confirmed the superior ability of a pretrained model that leaps in performance. Next, temporal modelling was investigated to detect the underlying temporal variation phases using supervised and unsupervised methods from dynamic sequences. During these processes, the importance of stacking dynamic on top of static was discovered in encoding deep features for learning temporal information when combining the spatial and temporal schemes simultaneously. Also, this study found that fusing both temporal and temporal features will give more long term temporal pattern information. Moreover, we hypothesised that using an unsupervised method would enable the leaching of invariant information from dynamic textures. Recently, fresh cutting-edge developments have been created by approaches based on Generative Adversarial Networks (GANs). In the second section of this thesis, we propose a model based on the adoption of an unsupervised DCGAN for the facial features’ extraction and classification to achieve the following: the creation of facial expression images under different arbitrary poses (frontal, multi-view, and in the wild), and the recognition of emotion categories and AUs, in an attempt to resolve the problem of recognising the static seven classes of emotion in the wild. Thorough experimentation with the proposed cross-database performance demonstrates that this approach can improve the generalization results. Additionally, we showed that the features learnt by the DCGAN process are poorly suited to encoding facial expressions when observed under multiple views, or when trained from a limited number of positive examples. Finally, this research focuses on disentangling identity from expression for facial expression recognition. A novel technique was implemented for emotion recognition from a single monocular image. A large-scale dataset (Face vid) was created from facial image videos which were rich in variations and distribution of facial dynamics, appearance, identities, expressions, and 3D poses. This dataset was used to train a DCNN (ResNet) to regress the expression parameters from a 3D Morphable Model jointly with a back-end classifier

    Multimedia Forensics

    Get PDF
    This book is open access. Media forensics has never been more relevant to societal life. Not only media content represents an ever-increasing share of the data traveling on the net and the preferred communications means for most users, it has also become integral part of most innovative applications in the digital information ecosystem that serves various sectors of society, from the entertainment, to journalism, to politics. Undoubtedly, the advances in deep learning and computational imaging contributed significantly to this outcome. The underlying technologies that drive this trend, however, also pose a profound challenge in establishing trust in what we see, hear, and read, and make media content the preferred target of malicious attacks. In this new threat landscape powered by innovative imaging technologies and sophisticated tools, based on autoencoders and generative adversarial networks, this book fills an important gap. It presents a comprehensive review of state-of-the-art forensics capabilities that relate to media attribution, integrity and authenticity verification, and counter forensics. Its content is developed to provide practitioners, researchers, photo and video enthusiasts, and students a holistic view of the field

    Multimedia Forensics

    Get PDF
    This book is open access. Media forensics has never been more relevant to societal life. Not only media content represents an ever-increasing share of the data traveling on the net and the preferred communications means for most users, it has also become integral part of most innovative applications in the digital information ecosystem that serves various sectors of society, from the entertainment, to journalism, to politics. Undoubtedly, the advances in deep learning and computational imaging contributed significantly to this outcome. The underlying technologies that drive this trend, however, also pose a profound challenge in establishing trust in what we see, hear, and read, and make media content the preferred target of malicious attacks. In this new threat landscape powered by innovative imaging technologies and sophisticated tools, based on autoencoders and generative adversarial networks, this book fills an important gap. It presents a comprehensive review of state-of-the-art forensics capabilities that relate to media attribution, integrity and authenticity verification, and counter forensics. Its content is developed to provide practitioners, researchers, photo and video enthusiasts, and students a holistic view of the field
    corecore