20 research outputs found

    The Many Moods of Emotion

    Full text link
    This paper presents a novel approach to the facial expression generation problem. Building upon the assumption of the psychological community that emotion is intrinsically continuous, we first design our own continuous emotion representation with a 3-dimensional latent space issued from a neural network trained on discrete emotion classification. The so-obtained representation can be used to annotate large in the wild datasets and later used to trained a Generative Adversarial Network. We first show that our model is able to map back to discrete emotion classes with a objectively and subjectively better quality of the images than usual discrete approaches. But also that we are able to pave the larger space of possible facial expressions, generating the many moods of emotion. Moreover, two axis in this space may be found to generate similar expression changes as in traditional continuous representations such as arousal-valence. Finally we show from visual interpretation, that the third remaining dimension is highly related to the well-known dominance dimension from psychology

    Unnatural language processing: How do language models handle machine-generated prompts?

    Full text link
    Language model prompt optimization research has shown that semantically and grammatically well-formed manually crafted prompts are routinely outperformed by automatically generated token sequences with no apparent meaning or syntactic structure, including sequences of vectors from a model's embedding space. We use machine-generated prompts to probe how models respond to input that is not composed of natural language expressions. We study the behavior of models of different sizes in multiple semantic tasks in response to both continuous and discrete machine-generated prompts, and compare it to the behavior in response to human-generated natural-language prompts. Even when producing a similar output, machine-generated and human prompts trigger different response patterns through the network processing pathways, including different perplexities, different attention and output entropy distributions, and different unit activation profiles. We provide preliminary insight into the nature of the units activated by different prompt types, suggesting that only natural language prompts recruit a genuinely linguistic circuit.Comment: Findings of EMNLP 2023 Camera-Read

    CAKE: Compact and Accurate K-dimensional representation of Emotion

    Get PDF
    Numerous models describing the human emotional states have been built by the psychology community. Alongside, Deep Neural Networks (DNN) are reaching excellent performances and are becoming interesting features extraction tools in many computer vision tasks.Inspired by works from the psychology community, we first study the link between the compact two-dimensional representation of the emotion known as arousal-valence, and discrete emotion classes (e.g. anger, happiness, sadness, etc.) used in the computer vision community. It enables to assess the benefits -- in terms of discrete emotion inference -- of adding an extra dimension to arousal-valence (usually named dominance). Building on these observations, we propose CAKE, a 3-dimensional representation of emotion learned in a multi-domain fashion, achieving accurate emotion recognition on several public datasets. Moreover, we visualize how emotions boundaries are organized inside DNN representations and show that DNNs are implicitly learning arousal-valence-like descriptions of emotions. Finally, we use the CAKE representation to compare the quality of the annotations of different public datasets

    Estimating semantic structure for the VQA answer space

    Full text link
    Since its appearance, Visual Question Answering (VQA, i.e. answering a question posed over an image), has always been treated as a classification problem over a set of predefined answers. Despite its convenience, this classification approach poorly reflects the semantics of the problem limiting the answering to a choice between independent proposals, without taking into account the similarity between them (e.g. equally penalizing for answering cat or German shepherd instead of dog). We address this issue by proposing (1) two measures of proximity between VQA classes, and (2) a corresponding loss which takes into account the estimated proximity. This significantly improves the generalization of VQA models by reducing their language bias. In particular, we show that our approach is completely model-agnostic since it allows consistent improvements with three different VQA models. Finally, by combining our method with a language bias reduction approach, we report SOTA-level performance on the challenging VQAv2-CP dataset

    Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks

    Get PDF
    The large adoption of the self-attention (i.e. transformer model) and BERT-like training principles has recently resulted in a number of high performing models on a large panoply of vision-and-language problems (such as Visual Question Answering (VQA), image retrieval, etc.). In this paper we claim that these State-Of-The-Art (SOTA) approaches perform reasonably well in structuring information inside a single modality but, despite their impressive performances , they tend to struggle to identify fine-grained inter-modality relationships. Indeed, such relations are frequently assumed to be implicitly learned during training from application-specific losses, mostly cross-entropy for classification. While most recent works provide inductive bias for inter-modality relationships via cross attention modules, in this work, we demonstrate (1) that the latter assumption does not hold, i.e. modality alignment does not necessarily emerge automatically, and (2) that adding weak supervision for alignment between visual objects and words improves the quality of the learned models on tasks requiring reasoning. In particular , we integrate an object-word alignment loss into SOTA vision-language reasoning models and evaluate it on two tasks VQA and Language-driven Comparison of Images. We show that the proposed fine-grained inter-modality supervision significantly improves performance on both tasks. In particular, this new learning signal allows obtaining SOTA-level performances on GQA dataset (VQA task) with pre-trained models without finetuning on the task, and a new SOTA on NLVR2 dataset (Language-driven Comparison of Images). Finally, we also illustrate the impact of the contribution on the models reasoning by visualizing attention distributions

    An Occam’s Razor View on Learning Audiovisual Emotion Recognition with Small Training Sets

    Get PDF
    International audienceThis paper presents a light-weight and accurate deep neural model for audiovisual emotion recognition. To design this model, the authors followed a philosophy of simplicity, drastically limiting the number of parameters to learn from the target datasets, always choosing the simplest earning methods: i) transfer learning and low-dimensional space embedding allows to reduce the dimensionality of the representations. ii) The isual temporal information is handled by a simple score-per-frame selection process, averaged across time. iii) A simple frame selection echanism is also proposed to weight the images of a sequence. iv) The fusion of the different modalities is performed at prediction level (late usion). We also highlight the inherent challenges of the AFEW dataset and the difficulty of model selection with as few as 383 validation equences. The proposed real-time emotion classifier achieved a state-of-the-art accuracy of 60.64 % on the test set of AFEW, and ranked 4th at he Emotion in the Wild 2018 challenge

    CAKE: Compact and Accurate K-dimensional representation of Emotion

    Get PDF
    International audienceNumerous models describing the human emotional states have been built by the psychology community. Alongside, Deep Neural Networks (DNN) are reaching excellent performances and are becoming interesting features extraction tools in many computer vision tasks.Inspired by works from the psychology community, we first study the link between the compact two-dimensional representation of the emotion known as arousal-valence, and discrete emotion classes (e.g. anger, happiness, sadness, etc.) used in the computer vision community. It enables to assess the benefits -- in terms of discrete emotion inference -- of adding an extra dimension to arousal-valence (usually named dominance). Building on these observations, we propose CAKE, a 3-dimensional representation of emotion learned in a multi-domain fashion, achieving accurate emotion recognition on several public datasets. Moreover, we visualize how emotions boundaries are organized inside DNN representations and show that DNNs are implicitly learning arousal-valence-like descriptions of emotions. Finally, we use the CAKE representation to compare the quality of the annotations of different public datasets

    How Transferable are Reasoning Patterns in VQA?

    Full text link
    Since its inception, Visual Question Answering (VQA) is notoriously known as a task, where models are prone to exploit biases in datasets to find shortcuts instead of performing high-level reasoning. Classical methods address this by removing biases from training data, or adding branches to models to detect and remove biases. In this paper, we argue that uncertainty in vision is a dominating factor preventing the successful learning of reasoning in vision and language problems. We train a visual oracle and in a large scale study provide experimental evidence that it is much less prone to exploiting spurious dataset biases compared to standard models. We propose to study the attention mechanisms at work in the visual oracle and compare them with a SOTA Transformer-based model. We provide an in-depth analysis and visualizations of reasoning patterns obtained with an online visualization tool which we make publicly available (https://reasoningpatterns.github.io). We exploit these insights by transferring reasoning patterns from the oracle to a SOTA Transformer-based VQA model taking standard noisy visual inputs via fine-tuning. In experiments we report higher overall accuracy, as well as accuracy on infrequent answers for each question type, which provides evidence for improved generalization and a decrease of the dependency on dataset biases

    Biais et raisonnement dans les systèmes de questions réponses visuelles

    No full text
    This thesis addresses the Visual Question Answering (VQA) task through the prism of biases and reasoning. VQA is a visual reasoning task where a model is asked to automatically answer questions posed over images. Despite impressive improvement made by deep learning approaches, VQA models are notorious for their tendency to rely on dataset biases, preventing them from learning to `reason’. Our first objective is to rethink the evaluation of VQA models. Questions and concepts being unequally distributed, the standard VQA evaluation metric, consisting in measuring the overall in-domain accuracy, tends to favour models which exploit subtle training set statistics. We introduce the GQA-OOD benchmark designed to overcome these concerns: we measure and compare accuracy over both rare and frequent question-answer pairs, and argue that the former is better suited to the evaluation of reasoning abilities. Evaluating models on benchmarks is important but not sufficient, it only gives an incomplete understanding of their capabilities. We conduct a deep analysis of a state-of-the-art Transformer VQA architecture, by studying its internal attention mechanisms. Our experiments provide evidence of the existence of operating reasoning patterns, at work in the model’s attention layers, when the training conditions are favourable enough. As part of this study, we design an interactive demonstration (available at https://visqa.liris.cnrs.fr/) exploring the question of reasoning vs. bias exploitation in VQA. Finally, drawing conclusion from our evaluations and analyses, we come up with a method for improving VQA model performances. We explore the transfer of reasoning patterns learned by a visual oracle, trained with perfect visual input, to a standard VQA model with imperfect visual representation. Furthermore, we propose to catalyse the transfer though reasoning supervision, either by adding an object-word alignment objective, or by predicting the sequence of reasoning operations required to answer the question.De quelle couleur est le terrain de tennis ? Quelle est la taille du chien ? Y a-t-il une voiture à droite du vélo sous le cocotier ? Répondre à ces questions fondamentales est le sujet de la tâche appelée question-réponses visuelle (VQA, en anglais), dans laquelle un agent doit répondre à des questions posées sur des images. Plus précisément, le VQA requiert de mettre au point un agent capable de maitriser une grande variété de compétences : reconnaître des objets, reconnaitre des attributs (couleur, taille, matériaux, etc.), identifier des relations (e.g. spatiales), déduire des enchainements logiques, etc. C'est pourquoi, le VQA est parfois désigné comme un test de Turing visuel, dont le but est d'évaluer la capacité d'un agent à raisonner sur des images. Cette tâche a récemment connu d'important progrès grâce à l'utilisation des réseaux de neurones et de l'apprentissage profond. Après une revue détaillée de l'État de l'Art sur le VQA, ainsi qu'une définition de notre utilisation du terme raisonnement, nous nous intéressons à la question suivante : les modèles de VQA actuels raisonnent-ils vraiment ? La mise en œuvre d'une nouvelle méthode d'évaluation (GQA-OOD) nous permettra de répondre négativement à cette question. En particulier, nous mettrons en évidence la tendance des modèles à apprendre des raccourcis, autrement appelés biais, présent dans les données d'entrainement, mais heurtant les capacités de généralisation. Nous proposerons alors, dans une troisième partie une analyse approfondie des mécanismes d'attention appris par les réseaux de neurones artificiels. Nous étudierons quels sont les enchainements aboutissant à un raisonnement, ou, au contraire, à une prédiction biaisée par un raccourci frauduleux. La dernière et quatrième partie tire conclusion de nos évaluations et analyses, afin de développer de nouvelles méthodes améliorant les performances des modèles de VQA. En résumé, cette thèse a pour objet l'étude du raisonnement visuel dans des réseaux de neurones artificiels entrainés par apprentissage profond, dans le cadre du VQA. Mais surtout, ce qui nous intéressera en premier lieu, c'est l'évaluation et l'analyse de l'influence qu'ont les biais, présents dans les données d'apprentissage, sur les prédictions de nos modèles

    Biais et raisonnement dans les systèmes de questions réponses visuelles

    No full text
    This thesis addresses the VQA task through the prism of biases and reasoning. VQA is a visual reasoning task where a model is asked to automatically answer questions posed over images. Despite impressive improvement made by deep learning approaches, VQA models are notorious for their tendency to rely on dataset biases. The large and unbalanced diversity of questions and concepts involved in the task, and the lack of well-annotated data, tend to prevent deep learning models from learning to `reason’. Instead, it leads them to perform `shortcuts’, relying on specific training set statistics, which is not helpful for generalizing to real-world scenarios. Because the root of this generalization curse is first and foremost a task definition problem, our first objective is to rethink the evaluation of VQA models. Questions and concepts being unequally distributed, the standard VQA evaluation metric, consisting in measuring the overall in-domain accuracy, tends to favour models which exploit subtle training set statistics. If the model predicts the correct answer of a question, is it necessarily reasoning? Can we detect when the model prediction is right for the right reason? And, at the opposite, can we identify when the model is `cheating’ by using statistical shortcuts? We overcome these concerns by introducing the gqa-ood benchmark: we measure and compare accuracy over both rare and frequent question-answer pairs and argue that the former is better suited to evaluate the reasoning abilities. We experimentally demonstrate that VQA models, including bias reduction methods, dramatically fail in this setting. Evaluating models on benchmarks is important but not sufficient, it only gives an incomplete understanding of their capabilities. We conduct a deep analysis of a state-of-the-art Transformer VQA architecture, by studying its internal attention mechanisms. Our experiments provide evidence of the existence of operating reasoning patterns, at work in the model’s attention layers, when the training conditions are favourable enough. More precisely, they appear when the visual representation is perfect, suggesting that uncertainty in vision is a dominating factor preventing the learning of reasoning. By collaborating with the data visualization experts, we have participated in the design of VisQA, a visual analytics tool exploring the question of reasoning vs. shortcuts in VQA.Finally, drawing conclusion from our evaluations and analyses, we come up with methods for improving VQA model performances. First, we propose to directly supervise the reasoning through a proxy loss measuring the fine-grained word-object alignment.We demonstrate, both experimentally and theoretically, the benefit of such reasoning supervision.Second, we explore the transfer of reasoning patterns learned by a visual oracle, trained with perfect visual input, to a standard VQA model with imperfect visual representation. Experiments show the transfer improves generalization and allows decreasing the dependency on dataset biases. Furthermore, we demonstrate that the reasoning supervision can be used as a catalyst for transferring the reasoning patterns.De quelle couleur est le terrain de tennis ? Quelle est la taille du chien ? Y a-t-il une voiture à droite du vélo sous le cocotier ? Répondre à ces questions fondamentales est le sujet de la tâche appelée question-réponses visuelle (VQA, en anglais), dans laquelle un agent doit répondre à des questions posées sur des images.Plus précisément, le VQA requiert de mettre au point un agent capable de maitriser une grande variété de compétences : reconnaître des objets, reconnaitre des attributs (couleur, taille, matériaux, etc.), identifier des relations (par exemple, spatiales), déduire des enchainements logiques, etc. C'est pourquoi, le VQA est parfois désigné comme un test de Turing visuel, dont le but est d'évaluer la capacité d'un agent à raisonner sur des images. Cette tâche a récemment connu d'important progrès grâce à l'utilisation des réseaux de neurones et de l'apprentissage profond.Après une revue détaillée de l'État de l'Art sur le VQA, ainsi qu'une définition de notre utilisation du terme raisonnement, nous nous intéressons à la question suivante : les modèles de VQA actuels raisonnent-ils vraiment ? La mise en œuvre d'une nouvelle méthode d'évaluation (GQA-OOD) nous permettra de répondre négativement à cette question. En particulier, nous mettrons en évidence la tendance des modèles à apprendre des raccourcis, autrement appelés biais, présent dans les données d'entrainement, mais heurtant les capacités de généralisation. Nous proposerons alors, dans une troisième partie une analyse approfondie des mécanismes d'attention appris par les réseaux de neurones artificiels. Nous étudierons quels sont les enchainements aboutissant à un raisonnement, ou, au contraire, à une prédiction biaisée par un raccourci frauduleux. La dernière et quatrième partie tire conclusion de nos évaluations et analyses, afin de développer de nouvelles méthodes améliorant les performances des modèles de VQA.En résumé, cette thèse a pour objet l'étude du raisonnement visuel dans des réseaux de neurones artificiels entrainés par apprentissage profond, dans le cadre du VQA. Mais surtout, ce qui nous intéressera en premier lieu, c'est l'évaluation et l'analyse de l'influence qu'ont les biais, présents dans les données d'apprentissage, sur les prédictions de nos modèles
    corecore