133 research outputs found

    EmoNets: Multimodal deep learning approaches for emotion recognition in video

    Full text link
    The task of the emotion recognition in the wild (EmotiW) Challenge is to assign one of seven emotions to short video clips extracted from Hollywood style movies. The videos depict acted-out emotions under realistic conditions with a large degree of variation in attributes such as pose and illumination, making it worthwhile to explore approaches which consider combinations of features from multiple modalities for label assignment. In this paper we present our approach to learning several specialist models using deep learning techniques, each focusing on one modality. Among these are a convolutional neural network, focusing on capturing visual information in detected faces, a deep belief net focusing on the representation of the audio stream, a K-Means based "bag-of-mouths" model, which extracts visual features around the mouth region and a relational autoencoder, which addresses spatio-temporal aspects of videos. We explore multiple methods for the combination of cues from these modalities into one common classifier. This achieves a considerably greater accuracy than predictions from our strongest single-modality classifier. Our method was the winning submission in the 2013 EmotiW challenge and achieved a test set accuracy of 47.67% on the 2014 dataset

    Batik Classification using Deep Convolutional Network Transfer Learning

    Get PDF
    Batik fabric is one of the most profound cultural heritage in Indonesia. Hence, continuous research on understanding it is necessary to preserve it. Despite of being one of the most common research task, Batik’s pattern automatic classification still requires some improvement especially in regards to invariance dilemma. Convolutional neural network (ConvNet) is one of deep learning architecture which able to learn data representation by combining local receptive inputs, weight sharing and convolutions in order to solve invariance dilemma in image classification. Using dataset of 2,092 Batik patches (5 classes), the experiments show that the proposed model, which used deep ConvNet VGG16 as feature extractor (transfer learning), achieves slightly better average of 89 ± 7% accuracy than SIFT and SURF-based that achieve 88 ± 10% and 88 ± 8% respectively. Despite of that, SIFT reaches around 5% better accuracy in rotated and scaled dataset

    Aprendendo características de imagens por redes convolucionais sob restrição de dados supervisionados

    Get PDF
    Orientador: Alexandre Xavier FalcãoDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: A análise de imagens vem sendo largamente aplicada em diversas áreas das Ciências e Engenharia, com o intuito de extrair e interpretar o conteúdo de interesse em aplicações que variam de uma simple análise de códigos de barras ao diagnóstico automatizado de doenças. Entretanto, as soluções do Estado da Arte baseadas em redes neurais com múltiplas camadas usualmente requerem um elevado número de amostras anotadas (rotuladas), implicando em um considerável esforço humano na identificação, isolamento, e anotação dessas amostras em grandes bases de dados. O problema é agravado quando tal anotação requer especialistas no domínio da aplicação, tal como em Medicina e Agricultura, constituindo um inconveniente crucial em tais aplicações. Neste contexto, as Redes de Convolução (Convolution Networks - ConvNets), estão entre as abordagens mais bem sucedidas na extração de características de imagens, tal que, sua associação com Perceptrons Multi-Camadas (Multi Layer Perceptron - MLP) ou Máquinas de Vetores de Suporte (Support Vector Machines - SVM) permite uma classificação de amostras bastante efetiva. Outro problema importante de tais técnicas se encontra na alta dimensionalidade de suas características, que dificulta o processo de análise da distribuição das amostras por métodos baseados em distância Euclidiana, como agrupamento e visualização de dados multidimensionais. Considerando tais problemas, avaliamos as principais estratégias no projeto de ConvNets, a saber, Aprendizado de Arquitetura (Architecture Learning - AL), Aprendizado de Filtros (Filter Learning - FL) e Aprendizado por Transferência de Domínio (Transfer Learning - TL) em relação a sua capacidade de aprendizado num conjunto limitado de amostras anotadas. E, para confirmar a eficácia no aprendizado de características, analisamos a melhoria do classificador conforme o número de amostras aumenta durante o aprendizado ativo. Métodos de data augmentation também foram avaliados como uma potencial estratégia para lidar com a ausência de amostras anotadas. Finalmente, apresentamos os principais resultados do trabalho numa aplicação real ¿ o diagnóstico de parasitos intestinais ¿ em comparação com os descritores do Estado da Arte. Por fim, pudemos concluir que TL se apresenta como a melhor estratégia, sob restrição de dados supervisionados, sempre que tivermos uma rede previamente aprendida que se aplique ao problema em questão. Caso contrário, AL se apresenta como a segunda melhor alternativa. Pudemos ainda observar a eficácia da Análise Discriminante Linear (Linear Discriminant Analysis - LDA) em reduzir consideravelmente o espaço de características criado pelas ConvNets, permitindo uma melhor compreensão dos especialistas sobre os processos de aprendizado de características e aprendizado ativo, por meio de técnicas de visualização de dados multidimensionais. Estes importantes resultados sugerem que uma interação entre aprendizado de características, aprendizado ativo, e especialistas, pode beneficiar consideravelmente o aprendizado de máquinaAbstract: Image analysis has been widely employed in many areas of the Sciences and Engineering to extract and interpret high-level information from images, with applications ranging from a simple bar code analysis to the diagnosis of diseases. However, the state-of-the-art solutions based on deep learning often require a training set with a high number of annotated (labeled) examples. This may imply significant human effort in sample identification, isolation, and labeling from large image databases, specially when image annotation asks for specialists in the application domain, such as in Medicine and Agriculture, such requirement constitutes a crucial drawback. In this context, Convolution Networks (ConvNets) are among the most successful approaches for image feature extraction, such that their combination with a Multi-Layer Perceptron (MLP) network or a Support Vector Machine (SVM) can be used for effective sample classification. Another problem in these techniques is the resulting high-dimension feature space, which makes difficult the analysis of the sample distribution by the commonly used distance based data clustering and visualization methods. In this work, we analyze both problems by assessing the main strategies for ConvNet design, namely Architecture Learning (AL), Filter Learning (FL), and Transfer Learning (TL), according to their capability of learning from a limited number of labeled examples, and by evaluating the impact of feature space reduction techniques in distance-based data classification and visualization. In order to confirm the effectiveness of feature learning, we analyze the progress of the classifier as the number of supervised samples increase during active learning. Data augmentation has also been evaluated as a potential strategy to cope with the absence of labeled examples. Finally, we demonstrate the main results of the work for a real application ¿ the diagnosis of intestinal parasites ¿ in comparison to the state-of-the-art image descriptors. In conclusion, TL has shown to be the best strategy, under supervised data constraint, whenever we count with a learned network that suits the problem. When this is not the case, AL comes as the second best alternative. We have also observed the effectiveness of Linear Discriminant Analysis (LDA) in considerably reducing the feature space created by ConvNets to allow a better understanding of the feature learning and active learning processes by the expert through data visualization. This important result suggests an interplaying between feature and active learning with intervening of the experts to improve both processes as future workMestradoCiência da ComputaçãoMestre em Ciência da ComputaçãoCNPQCAPE

    Emotion Recognition with Deep Neural Networks

    Get PDF
    RÉSUMÉ La reconnaissance automatique des émotions humaines a été étudiée pendant des décennies. Il est l'un des éléments clés de l'interaction homme-ordinateur dans les domaines des soins de santé, de l'éducation, du divertissement et de la publicité. La reconnaissance des émotions est une tâche difficile car elle repose sur la prédiction des états émotionnels abstraits à partir de données d'entrée multimodales. Ces modalités comprennent la vidéo, l’audio et des signaux physiologiques. La modalité visuelle est l'un des canaux les plus informatifs. Notons en particulier les expressions du visage qui sont un très fort indicateur de l'état émotionnel d'un sujet. Un système automatisé commun de reconnaissance d'émotion comprend plusieurs étapes de traitement, dont chacune doit être réglée et intégrée dans un pipeline. Ces pipelines sont souvent ajustés à la main, et ce processus peut introduire des hypothèses fortes sur les propriétés de la tâche et des données. Limiter ces hypothèses et utiliser un apprentissage automatique du pipeline de traitement de données donne souvent des solutions plus générales. Au cours des dernières années, il a été démontré que les méthodes d'apprentissage profond mènent à de bonnes représentations pour diverses modalités. Pour de nombreux benchmarks, l'écart diminue rapidement entre les algorithmes de pointe basés sur des réseaux neuronaux profonds et la performance humaine. Ces réseaux apprennent hiérarchies de caractéristiques. Avec la profondeur croissante, ces hiérarchies peuvent décrire des concepts plus abstraits. Cette progrès suggèrent d'explorer les applications de ces méthodes d'apprentissage à l'analyse du visage et de la reconnaissance des émotions. Cette thèse repose sur une étude préliminaire et trois articles, qui contribuent au domaine de la reconnaissance des émotions. L'étude préliminaire présente une nouvelle variante de Patterns Binaires Locales (PBL), qui est utilisé comme une représentation binaire de haute dimension des images faciales. Il est commun de créer des histogrammes de caractéristiques de PBL dans les régions d'images d'entrée. Toutefois, dans ce travail, ils sont utilisés en tant que vecteurs binaires de haute dimension qui sont extraits à des échelles multiples autour les points clés faciales détectées. Nous examinons un pipeline constitué de la réduction de la dimensionnalité non supervisé et supervisé, en utilisant l'Analyse en Composantes Principales (ACP) et l'Analyse Discriminante Fisher Locale (ADFL), suivi d'une Machine à Vecteurs de Support (MVS) comme classificateur pour la prédiction des expressions faciales. Les expériences montrent que les étapes de réduction de dimensionnalité fournissent de la robustesse en présence de bruit dans points clés. Cette approche atteint, lors de sa publication, des performances de l’état de l’art dans la reconnaissance de l'expression du visage sur l’ensemble de données Extended Cohn-Kanade (CK+) (Lucey et al, 2010) et sur la détection de sourire sur l’ensemble de données GENKI (GENKI-4K, 2008). Pour la tâche de détection de sourire, un profond Réseau Neuronal Convolutif (RNC) a été utilisé pour référence fiable. La reconnaissance de l'émotion dans les vidéos semblable à ceux de la vie de tous les jours, tels que les clips de films d'Hollywood dans l'Emotion Recognition in the Wild (EmotiW) challenge (Dhall et al, 2013), est beaucoup plus difficile que dans des environnements de laboratoire contrôlées. Le premier article est une analyse en profondeur de la entrée gagnante de l'EmotiW 2013 challenge (Kahou et al, 2013) avec des expériments supplémentaires sur l'ensemble de données du défi de l’an 2014. Le pipeline est constitué d'une combinaison de modèles d'apprentissage en profondeur, chacun spécialisé dans une modalité. Ces modèles comprennent une nouvelle technique d’agrégation de caractéristiques d’images individuelles pour permettre de transférer les caractéristiques apprises par réseaux convolutionnels (CNN) sur un grand ensemble de données d’expressions faciales, et de les application au domaine de l’analyse de contenu vidéo. On y trouve aussi un ``deep belief net'' (DBN) pour les caractéristiques audio, un pipeline de reconnaissance d’activité pour capturer les caractéristiques spatio-temporelles, ainsi qu’modèle de type ``bag-of-mouths'' basé sur k-means pour extraire les caractéristiques propres à la bouche. Plusieurs approches pour la fusion des prédictions des modèles spécifiques à la modalité sont comparés. La performance après un nouvel entraînement basé sur les données de 2014, établis avec quelques adaptations, est toujours comparable à l’état de l’art actuel. Un inconvénient de la méthode décrite dans le premier article est l'approche de l'agrégation de la modalité visuelle qui implique la mise en commun par image requiert un vecteur de longueur fixe. Cela ne tient pas compte de l'ordre temporel à l'intérieur des segments groupés. Les Réseau de Neurones Récurrents (RNR) sont des réseaux neuronaux construits pour le traitement séquentiel des données. Ils peuvent résoudre ce problème en résumant les images dans un vecteur de valeurs réelles qui est mis à jour à chaque pas de temps. En général, les RNR fournissent une façon d'apprendre une approche d'agrégation d'une manière axée sur les données. Le deuxième article analyse l'application d'un RNR sur les caractéristiques issues d’un réseau neuronal de convolution utilisé pour la reconnaissance des émotions dans la vidéo. Une comparaison de la RNR avec l'approche fondée sur pooling montre une amélioration significative des performances de classification. Il comprend également une fusion au niveau de la caractéristiques et au niveau de décision de modèles pour différentes modalités. En plus d’utiliser RNR comme dans les travaux antérieurs, il utilise aussi un modèle audio basé sur MVS, ainsi que l'ancien modèle d'agrégation qui sont fusionnées pour améliorer les performances sur l'ensemble de données de défi EmotiW 2015. Cette approche a terminé en troisième position dans le concours, avec une différence de seulement 1% dans la précision de classification par rapport au modèle gagnant. Le dernier article se concentre sur un problème de vision par ordinateur plus général, à savoir le suivi visuel. Un RNR est augmenté avec un mécanisme d'attention neuronal qui lui permet de se concentrer sur l'information liée à une tâche, ignorant les distractions potentielles dans la trame vidéo d'entrée. L'approche est formulée dans un cadre neuronal modulaire constitué de trois composantes: un module d'attention récurrente qui détermine où chercher, un module d'extraction de caractéristiques fournissant une représentation de quel objet est vu, et un module objectif qui indique pourquoi un comportement attentionnel est appris. Chaque module est entièrement différentiables, ce qui permet une optimisation simple à base de gradient. Un tel cadre pourrait être utilisé pour concevoir une solution de bout en bout pour la reconnaissance de l'émotion dans la vision, ne nécessitant pas les étapes initiales de détection de visage ou de localisation d’endroits d’intérêt. L'approche est présentée dans trois ensembles de données de suivi, y compris un ensemble de données du monde réel. En résumé, cette thèse explore et développe une multitude de techniques d'apprentissage en profondeur, complétant des étapes importantes en vue de l’objectif à long terme de la construction d'un système entraînable de bout en bout pour la reconnaissance des émotions.----------ABSTRACT Automatic recognition of human emotion has been studied for decades. It is one of the key components in human computer interaction with applications in health care, education, entertainment and advertisement. Emotion recognition is a challenging task as it involves predicting abstract emotional states from multi-modal input data. These modalities include video, audio and physiological signals. The visual modality is one of the most informative channels; especially facial expressions, which have been shown to be strong cues for the emotional state of a subject. A common automated emotion recognition system includes several processing steps, each of which has to be tuned and integrated into a pipeline. Such pipelines are often hand-engineered which can introduce strong assumptions about the properties of the task and data. Limiting assumptions and learning the processing pipeline from data often yields more general solutions. In recent years, deep learning methods have been shown to be able to learn good representations for various modalities. For many computer vision benchmarks, the gap between state-of-the-art algorithms based on deep neural networks and human performance is shrinking rapidly. These networks learn hierarchies of features. With increasing depth, these hierarchies can describe increasingly abstract concepts. This development suggests exploring the applications of such learning methods to facial analysis and emotion recognition. This thesis is based on a preliminary study and three articles, which contribute to the field of emotion recognition. The preliminary study introduces a new variant of Local Binary Patterns (LBPs), which is used as a high dimensional binary representation of facial images. It is common to create histograms of LBP features within regions of input images. However, in this work, they are used as high dimensional binary vectors that are extracted at multiple scales around detected facial keypoints. We examine a pipeline consisting of unsupervised and supervised dimensionality reduction, using Principal Component Analysis (PCA) and Local Fisher Discriminant Analysis (LFDA), followed by a Support Vector Machine (SVM) classifier for prediction of facial expressions. The experiments show that the dimensionality reduction steps provide robustness in the presence of noisy keypoints. This approach achieved state-of-the-art performance in facial expression recognition on the Extended Cohn-Kanade (CK+) data set (Lucey et al, 2010) and smile detection on the GENKI data set (GENKI-4K, 2008) at the time. For the smile detection task, a deep Convolutional Neural Network (CNN) was used as a strong baseline. Emotion recognition in close-to-real-world videos, such as the Hollywood film clips in the Emotion Recognition in the Wild (EmotiW) challenge (Dhall et al, 2013), is much harder than in controlled lab environments. The first article is an in-depth analysis of the EmotiW 2013 challenge winning entry (Kahou et al, 2013) with additional experiments on the data set of the 2014 challenge. The pipeline consists of a combination of deep learning models, each specializing on one modality. The models include the following: a novel aggregation of per-frame features helps to transfer powerful CNN features learned on a large pooled data set of facial expression images to the video domain, a Deep Belief Network (DBN) learns audio features, an activity recognition pipeline captures spatio-temporal motion features and a k-means based bag-of-mouths model extracts features around the mouth region. Several approaches for fusing the predictions of modality-specific models are compared. The performance after re-training on the 2014 data set with a few adaptions is still competitive to the new state-of-the-art. One drawback of the method described in the first article is the aggregation approach of the visual modality which involves pooling per-frame features into a fixed-length vector. This ignores the temporal order inside the pooled segments. Recurrent Neural Networks (RNNs) are neural networks built for sequential processing of data, which can address this issue by summarizing frames in a real-valued state vector that is updated at each time-step. In general, RNNs provide a way of learning an aggregation approach in a data-driven manner. The second article analyzes the application of an RNN on CNN features for emotion recognition in video. A comparison of the RNN with the pooling-based approach shows a significant improvement in classification performance. It also includes a feature-level fusion and decision-level fusion of models for different modalities. In addition to the RNN, the same activity pipeline as previous work, an SVM-based audio model and the old aggregation model are fused to boost performance on the EmotiW 2015 challenge data set. This approach was the second runner-up in the challenge with a small margin of 1% in classification accuracy to the challenge winner. The last article focuses on a more general computer vision problem, namely visual tracking. An RNN is augmented with a neural attention mechanism that allows it to focus on task-related information, ignoring potential distractors in input frames. The approach is formulated in a modular neural framework consisting of three components: a recurrent attention module controlling where to look, a feature-extraction module providing a representation of what is seen and an objective module which indicates why an attentional behaviour is learned. Each module is fully differentiable allowing simple gradient-based optimization. Such a framework could be used to design an end-to-end solution for emotion recognition in vision, potentially not requiring initial steps of face detection or keypoint localization. The approach is tested on three tracking data sets including one real-world data set. In summary, this thesis explores and develops a multitude of deep learning techniques, making significant steps towards a long-term goal of building an end-to-end trainable systems for emotion recognition

    Esquemas de transferência para aprendizado profundo em classificação de imagens

    Get PDF
    Orientadores: Eduardo Alves do Valle Junior, Sandra Eliza Fontes de AvilaDissertação (mestrado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de ComputaçãoResumo: Em Visão Computacional, a tarefa de classificação é complexa, pois visa a detecção da presença de categorias em imagens, dependendo criticamente da habilidade de aprender modelos computacionais generalistas a partir de amostras de treinamento. Aprendizado Profundo (AP) para tarefas visuais geralmente envolve o aprendizado de todos os passos deste processo, da extração de características até a atribuição de rótulos. Este tipo pervasivo de aprendizado garante aos modelos de AP maior capacidade de generalização, mas também traz novos desafios: um modelo de AP deverá estimar um grande número de parâmetros, exigindo um imenso conjunto de dados anotados e grandes quantidades de recursos computacionais. Neste contexto, a Transferência de Aprendizado emerge como uma solução promissora, permitindo a reciclagem de parâmetros aprendidos por modelos diferentes. Motivados pela crescente quantidade de evidências para o potencial de tais técnicas, estudamos de maneira abrangente a transferência de conhecimento de arquiteturas profundas aplicada ao reconhecimento de imagens. Nossos experimentos foram desenvolvidos para explorar representações internas de uma arquitetura profunda, testando sua robustez, redundância e precisão, com aplicações nos problemas de rastreio automático de melanoma, reconhecimento de cenas (MIT Indoors) e detecção de objetos (Pascal VOC). Também levamos a transferência a extremos, introduzindo a Transferência de Aprendizado Completa, que preserva a maior parte do modelo original, mostrando que esquemas agressivos de transferência podem atingir resultados competitivosAbstract: In Computer Vision, the task of classification is complex, as it aims to identify the presence of high-level categories in images, depending critically upon learning general models from a set of training samples. Deep Learning (DL) for visual tasks usually involves seamlessly learning every step of this process, from feature extraction to label assignment. This pervasive learning improves DL generalization abilities, but brings its own challenges: a DL model will have a huge number of parameters to estimate, thus requiring large amounts of annotated data and computational resources. In this context, transfer learning emerges as a promising solution, allowing one to recycle parameters learned among different models. Motivated by the growing amount of evidence for the potential of such techniques, we study transfer learning for deep architectures applied to image recognition. Our experiments are designed to explore the internal representations of DL architectures, testing their robustness, redundancy and precision, with applications to the problems of automated melanoma screening, scene recognition (MIT Indoors) and object detection (Pascal VOC). We also take transfer learning to extremes, introducing Complete Transfer Learning, which preserves most of the original model, showing that aggressive transfer schemes can reach competitive resultsMestradoEngenharia de ComputaçãoMestre em Engenharia Elétric
    • …
    corecore