224 research outputs found

    FACIAL EXPRESSION RECOGNITION AND EDITING WITH LIMITED DATA

    Get PDF
    Over the past five years, methods based on deep features have taken over the computer vision field. While dramatic performance improvements have been achieved for tasks such as face detection and verification, these methods usually need large amounts of annotated data. In practice, not all computer vision tasks have access to large amounts of annotated data. Facial expression analysis is such a task. In this dissertation, we focus on facial expression recognition and editing problems with small datasets. In addition, to cope with challenging conditions like pose and occlusion, we also study unaligned facial attribute detection and occluded expression recognition problems. This dissertation has been divided into four parts. In the first part, we present FaceNet2ExpNet, a novel idea to train a light-weight and high accuracy classification model for expression recognition with small datasets. We first propose a new distribution function to model the high-level neurons of the expression network. Based on this, a two-stage training algorithm is carefully designed. In the pre-training stage, we train the convolutional layers of the expression net, regularized by the face net; In the refining stage, we append fully-connected layers to the pre-trained convolutional layers and train the whole network jointly. Visualization shows that the model trained with our method captures improved high-level expression semantics. Evaluations on four public expression databases demonstrate that our method achieves better results than state-of-the-art. In the second part, we focus on robust facial expression recognition under occlusion and propose a landmark-guided attention branch to find and discard corrupted feature elements from recognition. An attention map is first generated to indicate if a specific facial part is occluded and guide our model to attend to the non-occluded regions. To further increase robustness, we propose a facial region branch to partition the feature maps into non-overlapping facial blocks and enforce each block to predict the expression independently. Depending on the synergistic effect of the two branches, our occlusion adaptive deep network significantly outperforms state-of-the-art methods on two challenging in-the-wild benchmark datasets and three real-world occluded expression datasets. In the third part, we propose a cascade network that simultaneously learns to localize face regions specific to attributes and performs attribute classification without alignment. First, a weakly-supervised face region localization network is designed to automatically detect regions (or parts) specific to attributes. Then multiple part-based networks and a whole-image-based network are separately constructed and combined together by the region switch layer and attribute relation layer for final attribute classification. A multi-net learning method and hint-based model compression are further proposed to get an effective localization model and a compact classification model, respectively. Our approach achieves significantly better performance than state-of-the-art methods on unaligned CelebA dataset, reducing the classification error by 30.9% In the final part of this dissertation, we propose an Expression Generative Adversarial Network (ExprGAN) for photo-realistic facial expression editing with controllable expression intensity. An expression controller module is specially designed to learn an expressive and compact expression code in addition to the encoder-decoder network. This novel architecture enables the expression intensity to be continuously adjusted from low to high. We further show that our ExprGAN can be applied for other tasks, such as expression transfer, image retrieval, and data augmentation for training improved face expression recognition models. To tackle the small size of the training database, an effective incremental learning scheme is proposed. Quantitative and qualitative evaluations on the widely used Oulu-CASIA dataset demonstrate the effectiveness of ExprGAN

    Face Hallucination via Deep Neural Networks.

    Get PDF
    We firstly address aligned low-resolution (LR) face images (i.e. 16X16 pixels) by designing a discriminative generative network, named URDGN. URDGN is composed of two networks: a generative model and a discriminative model. We introduce a pixel-wise L2 regularization term to the generative model and exploit the feedback of the discriminative network to make the upsampled face images more similar to real ones. We present an end-to-end transformative discriminative neural network (TDN) devised for super-resolving unaligned tiny face images. TDN embeds spatial transformation layers to enforce local receptive fields to line-up with similar spatial supports. To upsample noisy unaligned LR face images, we propose decoder-encoder-decoder networks. A transformative discriminative decoder network is employed to upsample and denoise LR inputs simultaneously. Then we project the intermediate HR faces to aligned and noise-free LR faces by a transformative encoder network. Finally, high-quality hallucinated HR images are generated by our second decoder. Furthermore, we present an end-to-end multiscale transformative discriminative neural network (MTDN) to super-resolve unaligned LR face images of different resolutions in a unified framework. We propose a method that explicitly incorporates structural information of faces into the face super-resolution process by using a multi-task convolutional neural network (CNN). Our method not only uses low-level information (i.e. intensity similarity), but also middle-level information (i.e. face structure) to further explore spatial constraints of facial components from LR inputs images. We demonstrate that supplementing residual images or feature maps with additional facial attribute information can significantly reduce the ambiguity in face super-resolution. To explore this idea, we develop an attribute-embedded upsampling network. In this manner, our method is able to super-resolve LR faces by a large upscaling factor while reducing the uncertainty of one-to-many mappings remarkably. We further push the boundaries of hallucinating a tiny, non-frontal face image to understand how much of this is possible by leveraging the availability of large datasets and deep networks. To this end, we introduce a novel Transformative Adversarial Neural Network (TANN) to jointly frontalize very LR out-of-plane rotated face images (including profile views) and aggressively super-resolve them by 8X, regardless of their original poses and without using any 3D information. Besides recovering an HR face images from an LR version, this thesis also addresses the task of restoring realistic faces from stylized portrait images, which can also be regarded as face hallucination

    Landmarks Augmentation with Manifold-Barycentric Oversampling

    Full text link
    The training of Generative Adversarial Networks (GANs) requires a large amount of data, stimulating the development of new augmentation methods to alleviate the challenge. Oftentimes, these methods either fail to produce enough new data or expand the dataset beyond the original manifold. In this paper, we propose a new augmentation method that guarantees to keep the new data within the original data manifold thanks to the optimal transport theory. The proposed algorithm finds cliques in the nearest-neighbors graph and, at each sampling iteration, randomly draws one clique to compute the Wasserstein barycenter with random uniform weights. These barycenters then become the new natural-looking elements that one could add to the dataset. We apply this approach to the problem of landmarks detection and augment the available annotation in both unpaired and in semi-supervised scenarios. Additionally, the idea is validated on cardiac data for the task of medical segmentation. Our approach reduces the overfitting and improves the quality metrics beyond the original data outcome and beyond the result obtained with popular modern augmentation methods.Comment: 11 pages, 4 figures, 3 tables. I.B. and N.B. contributed equally. D.V.D. is the corresponding autho

    Multimodal Sentiment Analysis Based on Deep Learning: Recent Progress

    Get PDF
    Multimodal sentiment analysis is an important research topic in the field of NLP, aiming to analyze speakers\u27 sentiment tendencies through features extracted from textual, visual, and acoustic modalities. Its main methods are based on machine learning and deep learning. Machine learning-based methods rely heavily on labeled data. But deep learning-based methods can overcome this shortcoming and capture the in-depth semantic information and modal characteristics of the data, as well as the interactive information between multimodal data. In this paper, we survey the deep learning-based methods, including fusion of text and image and fusion of text, image, audio, and video. Specifically, we discuss the main problems of these methods and the future directions. Finally, we review the work of multimodal sentiment analysis in conversation
    • …
    corecore