18 research outputs found

    Building Deep Networks on Grassmann Manifolds

    Full text link
    Learning representations on Grassmann manifolds is popular in quite a few visual recognition tasks. In order to enable deep learning on Grassmann manifolds, this paper proposes a deep network architecture by generalizing the Euclidean network paradigm to Grassmann manifolds. In particular, we design full rank mapping layers to transform input Grassmannian data to more desirable ones, exploit re-orthonormalization layers to normalize the resulting matrices, study projection pooling layers to reduce the model complexity in the Grassmannian context, and devise projection mapping layers to respect Grassmannian geometry and meanwhile achieve Euclidean forms for regular output layers. To train the Grassmann networks, we exploit a stochastic gradient descent setting on manifolds of the connection weights, and study a matrix generalization of backpropagation to update the structured data. The evaluations on three visual recognition tasks show that our Grassmann networks have clear advantages over existing Grassmann learning methods, and achieve results comparable with state-of-the-art approaches.Comment: AAAI'18 pape

    EmoNets: Multimodal deep learning approaches for emotion recognition in video

    Full text link
    The task of the emotion recognition in the wild (EmotiW) Challenge is to assign one of seven emotions to short video clips extracted from Hollywood style movies. The videos depict acted-out emotions under realistic conditions with a large degree of variation in attributes such as pose and illumination, making it worthwhile to explore approaches which consider combinations of features from multiple modalities for label assignment. In this paper we present our approach to learning several specialist models using deep learning techniques, each focusing on one modality. Among these are a convolutional neural network, focusing on capturing visual information in detected faces, a deep belief net focusing on the representation of the audio stream, a K-Means based "bag-of-mouths" model, which extracts visual features around the mouth region and a relational autoencoder, which addresses spatio-temporal aspects of videos. We explore multiple methods for the combination of cues from these modalities into one common classifier. This achieves a considerably greater accuracy than predictions from our strongest single-modality classifier. Our method was the winning submission in the 2013 EmotiW challenge and achieved a test set accuracy of 47.67% on the 2014 dataset

    Comparing sets of data sets on the Grassmann and flag manifolds with applications to data analysis in high and low dimensions

    Get PDF
    Includes bibliographical references.2020 Summer.This dissertation develops numerical algorithms for comparing sets of data sets utilizing shape and orientation of data clouds. Two key components for "comparing" are the distance measure between data sets and correspondingly the geodesic path in between. Both components will play a core role which connects two parts of this dissertation, namely data analysis on the Grassmann manifold and flag manifold. For the first part, we build on the well known geometric framework for analyzing and optimizing over data on the Grassmann manifold. To be specific, we extend the classical self-organizing mappings to the Grassamann manifold to visualize sets of high dimensional data sets in 2D space. We also propose an optimization problem on the Grassmannian to recover missing data. In the second part, we extend the geometric framework to the flag manifold to encode the variability of nested subspaces. There we propose a numerical algorithm for computing a geodesic path and distance between nested subspaces. We also prove theorems to show how to reduce the dimension of the algorithm for practical computations. The approach is shown to have advantages for analyzing data when the number of data points is larger than the number of features

    Fast and accurate image and video analysis on Riemannian manifolds

    Get PDF

    Image based Static Facial Expression Recognition with Multiple Deep Network Learning

    Get PDF
    ABSTRACT We report our image based static facial expression recognition method for the Emotion Recognition in the Wild Challenge (EmotiW) 2015. We focus on the sub-challenge of the SFEW 2.0 dataset, where one seeks to automatically classify a set of static images into 7 basic emotions. The proposed method contains a face detection module based on the ensemble of three state-of-the-art face detectors, followed by a classification module with the ensemble of multiple deep convolutional neural networks (CNN). Each CNN model is initialized randomly and pre-trained on a larger dataset provided by the Facial Expression Recognition (FER) Challenge 2013. The pre-trained models are then fine-tuned on the training set of SFEW 2.0. To combine multiple CNN models, we present two schemes for learning the ensemble weights of the network responses: by minimizing the log likelihood loss, and by minimizing the hinge loss. Our proposed method generates state-of-the-art result on the FER dataset. It also achieves 55.96% and 61.29% respectively on the validation and test set of SFEW 2.0, surpassing the challenge baseline of 35.96% and 39.13% with significant gains

    Facial expression recognition in the wild : from individual to group

    Get PDF
    The progress in computing technology has increased the demand for smart systems capable of understanding human affect and emotional manifestations. One of the crucial factors in designing systems equipped with such intelligence is to have accurate automatic Facial Expression Recognition (FER) methods. In computer vision, automatic facial expression analysis is an active field of research for over two decades now. However, there are still a lot of questions unanswered. The research presented in this thesis attempts to address some of the key issues of FER in challenging conditions mentioned as follows: 1) creating a facial expressions database representing real-world conditions; 2) devising Head Pose Normalisation (HPN) methods which are independent of facial parts location; 3) creating automatic methods for the analysis of mood of group of people. The central hypothesis of the thesis is that extracting close to real-world data from movies and performing facial expression analysis on movies is a stepping stone in the direction of moving the analysis of faces towards real-world, unconstrained condition. A temporal facial expressions database, Acted Facial Expressions in the Wild (AFEW) is proposed. The database is constructed and labelled using a semi-automatic process based on closed caption subtitle based keyword search. Currently, AFEW is the largest facial expressions database representing challenging conditions available to the research community. For providing a common platform to researchers in order to evaluate and extend their state-of-the-art FER methods, the first Emotion Recognition in the Wild (EmotiW) challenge based on AFEW is proposed. An image-only based facial expressions database Static Facial Expressions In The Wild (SFEW) extracted from AFEW is proposed. Furthermore, the thesis focuses on HPN for real-world images. Earlier methods were based on fiducial points. However, as fiducial points detection is an open problem for real-world images, HPN can be error-prone. A HPN method based on response maps generated from part-detectors is proposed. The proposed shape-constrained method does not require fiducial points and head pose information, which makes it suitable for real-world images. Data from movies and the internet, representing real-world conditions poses another major challenge of the presence of multiple subjects to the research community. This defines another focus of this thesis where a novel approach for modeling the perception of mood of a group of people in an image is presented. A new database is constructed from Flickr based on keywords related to social events. Three models are proposed: averaging based Group Expression Model (GEM), Weighted Group Expression Model (GEM_w) and Augmented Group Expression Model (GEM_LDA). GEM_w is based on social contextual attributes, which are used as weights on each person's contribution towards the overall group's mood. Further, GEM_LDA is based on topic model and feature augmentation. The proposed framework is applied to applications of group candid shot selection and event summarisation. The application of Structural SIMilarity (SSIM) index metric is explored for finding similar facial expressions. The proposed framework is applied to the problem of creating image albums based on facial expressions, finding corresponding expressions for training facial performance transfer algorithms

    Face Recognition and Facial Attribute Analysis from Unconstrained Visual Data

    Get PDF
    Analyzing human faces from visual data has been one of the most active research areas in the computer vision community. However, it is a very challenging problem in unconstrained environments due to variations in pose, illumination, expression, occlusion and blur between training and testing images. The task becomes even more difficult when only a limited number of images per subject is available for modeling these variations. In this dissertation, different techniques for performing classification of human faces as well as other facial attributes such as expression, age, gender, and head pose in uncontrolled settings are investigated. In the first part of the dissertation, a method for reconstructing the virtual frontal view from a given non-frontal face image using Markov Random Fields (MRFs) and an efficient variant of the Belief Propagation (BP) algorithm is introduced. In the proposed approach, the input face image is divided into a grid of overlapping patches and a globally optimal set of local warps is estimated to synthesize the patches at the frontal view. A set of possible warps for each patch is obtained by aligning it with images from a training database of frontal faces. The alignments are performed efficiently in the Fourier domain using an extension of the Lucas-Kanade (LK) algorithm that can handle illumination variations. The problem of finding the optimal warps is then formulated as a discrete labeling problem using an MRF. The reconstructed frontal face image can then be used with any face recognition technique. The two main advantages of our method are that it does not require manually selected facial landmarks as well as no head pose estimation is needed. In the second part, the task of face recognition in unconstrained settings is formulated as a domain adaptation problem. The domain shift is accounted for by deriving a latent subspace or domain, which jointly characterizes the multifactor variations using appropriate image formation models for each factor. The latent domain is defined as a product of Grassmann manifolds based on the underlying geometry of the tensor space, and recognition is performed across domain shift using statistics consistent with the tensor geometry. More specifically, given a face image from the source or target domain, multiple images of that subject are first synthesized under different illuminations, blur conditions, and 2D perturbations to form a tensor representation of the face. The orthogonal matrices obtained from the decomposition of this tensor, where each matrix corresponds to a factor variation, are used to characterize the subject as a point on a product of Grassmann manifolds. For cases with only one image per subject in the source domain, the identity of target domain faces is estimated using the geodesic distance on product manifolds. When multiple images per subject are available, an extension of kernel discriminant analysis is developed using a novel kernel based on the projection metric on product spaces. Furthermore, a probabilistic approach to the problem of classifying image sets on product manifolds is introduced. Understanding attributes such as expression, age class, and gender from face images has many applications in multimedia processing including content personalization, human-computer interaction, and facial identification. To achieve good performance in these tasks, it is important to be able to extract pertinent visual structures from the input data. In the third part of the dissertation, a fully automatic approach for performing classification of facial attributes based on hierarchical feature learning using sparse coding is presented. The proposed approach is generative in the sense that it does not use label information in the process of feature learning. As a result, the same feature representation can be applied for different tasks such as expression, age, and gender classification. Final classification is performed by linear SVM trained with the corresponding labels for each task. The last part of the dissertation presents an automatic algorithm for determining the head pose from a given face image. The face image is divided into a regular grid and represented by dense SIFT descriptors extracted from the grid points. Random Projection (RP) is then applied to reduce the dimension of the concatenated SIFT descriptor vector. Classification and regression using Support Vector Machine (SVM) are combined in order to obtain an accurate estimate of the head pose. The advantage of the proposed approach is that it does not require facial landmarks such as the eye and mouth corners, the nose tip to be extracted from the input face image as in many other methods

    Parametric face alignment : generative and discriminative approaches

    Get PDF
    Tese de doutoramento em Engenharia Electrotécnica e de Computadores, apresentada à Faculdade de Ciências e Tecnologia da Universidade de CoimbraThis thesis addresses the matching of deformable human face models into 2D images. Two di erent approaches are detailed: generative and discriminative methods. Generative or holistic methods model the appearance/texture of all image pixels describing the face by synthesizing the expected appearance (it builds synthetic versions of the target face). Discriminative or patch-based methods model the local correlations between pixel values. Such approach uses an ensemble of local feature detectors all connected by a shape regularization model. Typically, generative approaches can achieve higher tting accuracy, but discriminative methods perform a lot better in unseen images. The Active Appearance Models (AAMs) are probably the most widely used generative technique. AAMs match parametric models of shape and appearance into new images by solving a nonlinear optimization that minimizes the di erence between a synthetic template and the real appearance. The rst part of this thesis describes the 2.5D AAM, an extension of the original 2D AAM that deals with a full perspective projection model. The 2.5D AAM uses a 3D Point Distribution Model (PDM) and a 2D appearance model whose control points are de ned by a perspective projection of the PDM. Two model tting algorithms and their computational e cient approximations are proposed: the Simultaneous Forwards Additive (SFA) and the Normalization Forwards Additive (NFA). Robust solutions for the SFA and NFA are also proposed in order to take into account the self-occlusion and/or partial occlusion of the face. Extensive results, involving the tting convergence, tting performance in unseen data, robustness to occlusion, tracking performance and pose estimation are shown. The second main part of this thesis concerns to discriminative methods such as the Constrained Local Models (CLM) or the Active Shape Models (ASM), where an ensemble of local feature detectors are constrained to lie within the subspace spanned by a PDM. Fitting such a model to an image typically involves two steps: (1) a local search using a detector, obtaining response maps for each landmark and (2) a global optimization that nds the shape parameters that jointly maximize all the detection responses. This work proposes: Discriminative Bayesian Active Shape Models (DBASM) a new global optimization strategy, using a Bayesian approach, where the posterior distribution of the shape parameters are inferred in a maximum a posteriori (MAP) sense by means of a Linear Dynamical System (LDS). The DBASM approach models the covariance of the latent variables i.e. it uses 2nd order statistics of the shape (and pose) parameters. Later, Bayesian Active Shape Models (BASM) is presented. BASM is an extension of the previous DBASM formulation where the prior distribution is explicitly modeled by means of recursive Bayesian estimation. Extensive results are presented, evaluating DBASM and BASM global optimization strategies, local face parts detectors and tracking performance in several standard datasets. Qualitative results taken from the challenging Labeled Faces in the Wild (LFW) dataset are also shown. Finally, the last part of this thesis, addresses the identity and facial expression recognition. Face geometry is extracted from input images using the AAM and low dimensional manifolds were then derived using Laplacian EigenMaps (LE) resulting in two types of manifolds, one for representing identity and the other for person-speci c facial expression. The identity and facial expression recognition system uses a two stage approach: First, a Support Vector Machines (SVM) is used to establish identity across expression changes, then the second stage deals with person-speci c expression recognition with a network of Hidden Markov Models (HMMs). Results taken from people exhibiting the six basic expressions (happiness, sadness, anger, fear, surprise and disgust) plus the neutral emotion are shown.Esta tese aborda a correspond^encia de modelos humanos de faces deform aveis em imagens 2D. S~ao apresentadas duas abordagens diferentes: m etodos generativos e discriminativos. Os modelos generativos ou hol sticos modelam a apar^encia/textura de todos os pixeis que descrevem a face, sintetizando a apar^encia esperada (s~ao criadas vers~oes sint eticas da face alvo). Os modelos discriminativos ou baseados em partes modelam correla c~oes locais entre valores de pixeis. Esta abordagem utiliza um conjunto de detectores locais de caracter sticas, conectados por um modelo de regulariza c~ao geom etrico. Normalmente, as abordagens generativas permitem obter uma maior precis~ ao de ajuste do modelo, mas os m etodos discriminativos funcionam bastante melhor em imagens nunca antes vistas. Os Modelos Activos de Apar^encia (AAMs) s~ao provavelmente a t ecnica generativa mais utilizada. Os AAMs ajustam modelos param etricos de forma e apar^encia em imagens, resolvendo uma optimiza c~ao n~ao linear que minimiza a diferen ca entre o modelo sint etico e a apar^encia real. A primeira parte desta tese descreve os AAM 2.5D, uma extens~ao do AAM original 2D que permite a utiliza c~ao de um modelo de projec c~ao em perspectiva. Os AAM 2.5D utilizam um Modelo de Distribui c~ao de Pointos (PDM) e um modelo de apar^encia 2D cujos pontos de controlo s~ao de nidos por uma projec c~ao em perspectiva do PDM. Dois algoritmos de ajuste do modelo e as suas aproxima c~oes e cientes s~ao propostas: Simultaneous Forwards Additive (SFA) e o Normalization Forwards Additive (NFA). Solu c~oes robustas para o SFA e NFA, que contemplam a oclus~ao parcial da face, s~ao igualmente propostas. Resultados extensos, envolvendo a converg^encia de ajuste, o desempenho em imagens nunca vistas, robustez a oclus~ao, desempenho de seguimento e estimativa de pose s~ao apresentados. A segunda parte desta da tese diz respeito os m etodos discriminativos, tais como os Modelos Locais com Restri c~oes (CLM) ou os Modelos Activos de Forma (ASM), onde um conjunto de detectores de caracteristicas locais est~ao restritos a pertencer ao subespa co gerado por um PDM. O ajuste de um modelo deste tipo, envolve tipicamente duas et apas: (1) uma pesquisa local utilizando um detector, obtendo mapas de resposta para cada ponto de refer^encia e (2) uma estrat egia de optimiza c~ao global que encontra os par^ametros do PDM que permitem maximizar todas as respostas conjuntamente. Neste trabalho e proposto o Discriminative Bayesian Active Shape Models (DBASM), uma nova estrat egia de optimiza c~ao global que utiliza uma abordagem Bayesiana, onde a distribui c~ao a posteriori dos par^ametros de forma s~ao inferidos por meio de um sistema din^amico linear. A abordagem DBASM modela a covari^ancia das vari aveis latentes ou seja, e utilizado estat stica de segunda ordem na modela c~ao dos par^ametros. Posteriormente e apresentada a formula c~ao Bayesian Active Shape Models (BASM). O BASM e uma extens~ao do DBASM, onde a distribui c~ao a priori e explicitamente modelada por meio de estima c~ao Bayesiana recursiva. S~ao apresentados resultados extensos, avaliando as estrat egias de optimiza c~ao globais DBASM e BASM, detectores locais de componentes da face, e desempenho de seguimento em v arias bases de dados padr~ao. Resultados qualitativos extra dos da desa ante base de dados Labeled Faces in the Wild (LFW) s~ao tamb em apresentados. Finalmente, a ultima parte desta tese aborda o reconhecimento de id^entidade e express~oes faciais. A geometria da face e extra da de imagens utilizando o AAM e variedades de baixa dimensionalidade s~ao derivadas utilizando Laplacian EigenMaps (LE), obtendo-se dois tipos de variedades, uma para representar a id^entidade e a outra para express~oes faciais espec cas de cada pessoa. A id^entidade e o sistema de reconhecimento de express~oes faciais utiliza uma abordagem de duas fases: Num primeiro est agio e utilizado uma M aquina de Vectores de Suporte (SVM) para determinar a id^entidade, dedicando-se o segundo est agio ao reconhecimento de express~oes. Este est agio e especi co para cada pessoa e utiliza Modelos de Markov Escondidos (HMM). S~ao mostrados resultados obtidos em pessoas exibindo as seis express~oes b asicas (alegria, tristeza, raiva, medo, surpresa e nojo), e ainda a emo c~ao neutra
    corecore