309 research outputs found
Group Affect Prediction Using Multimodal Distributions
We describe our approach towards building an efficient predictive model to
detect emotions for a group of people in an image. We have proposed that
training a Convolutional Neural Network (CNN) model on the emotion heatmaps
extracted from the image, outperforms a CNN model trained entirely on the raw
images. The comparison of the models have been done on a recently published
dataset of Emotion Recognition in the Wild (EmotiW) challenge, 2017. The
proposed method achieved validation accuracy of 55.23% which is 2.44% above the
baseline accuracy, provided by the EmotiW organizers.Comment: This research paper has been accepted at Workshop on Computer Vision
for Active and Assisted Living, WACV 201
Looking Beyond a Clever Narrative: Visual Context and Attention are Primary Drivers of Affect in Video Advertisements
Emotion evoked by an advertisement plays a key role in influencing brand
recall and eventual consumer choices. Automatic ad affect recognition has
several useful applications. However, the use of content-based feature
representations does not give insights into how affect is modulated by aspects
such as the ad scene setting, salient object attributes and their interactions.
Neither do such approaches inform us on how humans prioritize visual
information for ad understanding. Our work addresses these lacunae by
decomposing video content into detected objects, coarse scene structure, object
statistics and actively attended objects identified via eye-gaze. We measure
the importance of each of these information channels by systematically
incorporating related information into ad affect prediction models. Contrary to
the popular notion that ad affect hinges on the narrative and the clever use of
linguistic and social cues, we find that actively attended objects and the
coarse scene structure better encode affective information as compared to
individual scene objects or conspicuous background elements.Comment: Accepted for publication in the Proceedings of 20th ACM International
Conference on Multimodal Interaction, Boulder, CO, US
Multimodal Explainable Artificial Intelligence: A Comprehensive Review of Methodological Advances and Future Research Directions
The current study focuses on systematically analyzing the recent advances in
the field of Multimodal eXplainable Artificial Intelligence (MXAI). In
particular, the relevant primary prediction tasks and publicly available
datasets are initially described. Subsequently, a structured presentation of
the MXAI methods of the literature is provided, taking into account the
following criteria: a) The number of the involved modalities, b) The stage at
which explanations are produced, and c) The type of the adopted methodology
(i.e. mathematical formalism). Then, the metrics used for MXAI evaluation are
discussed. Finally, a comprehensive analysis of current challenges and future
research directions is provided.Comment: 26 pages, 11 figure
From Pixels to Sentiment: Fine-tuning CNNs for Visual Sentiment Prediction
Visual multimedia have become an inseparable part of our digital social
lives, and they often capture moments tied with deep affections. Automated
visual sentiment analysis tools can provide a means of extracting the rich
feelings and latent dispositions embedded in these media. In this work, we
explore how Convolutional Neural Networks (CNNs), a now de facto computational
machine learning tool particularly in the area of Computer Vision, can be
specifically applied to the task of visual sentiment prediction. We accomplish
this through fine-tuning experiments using a state-of-the-art CNN and via
rigorous architecture analysis, we present several modifications that lead to
accuracy improvements over prior art on a dataset of images from a popular
social media platform. We additionally present visualizations of local patterns
that the network learned to associate with image sentiment for insight into how
visual positivity (or negativity) is perceived by the model.Comment: Accepted for publication in Image and Vision Computing. Models and
source code available at https://github.com/imatge-upc/sentiment-201
Towards A Robust Group-level Emotion Recognition via Uncertainty-Aware Learning
Group-level emotion recognition (GER) is an inseparable part of human
behavior analysis, aiming to recognize an overall emotion in a multi-person
scene. However, the existing methods are devoted to combing diverse emotion
cues while ignoring the inherent uncertainties under unconstrained
environments, such as congestion and occlusion occurring within a group.
Additionally, since only group-level labels are available, inconsistent emotion
predictions among individuals in one group can confuse the network. In this
paper, we propose an uncertainty-aware learning (UAL) method to extract more
robust representations for GER. By explicitly modeling the uncertainty of each
individual, we utilize stochastic embedding drawn from a Gaussian distribution
instead of deterministic point embedding. This representation captures the
probabilities of different emotions and generates diverse predictions through
this stochasticity during the inference stage. Furthermore,
uncertainty-sensitive scores are adaptively assigned as the fusion weights of
individuals' face within each group. Moreover, we develop an image enhancement
module to enhance the model's robustness against severe noise. The overall
three-branch model, encompassing face, object, and scene component, is guided
by a proportional-weighted fusion strategy and integrates the proposed
uncertainty-aware method to produce the final group-level output. Experimental
results demonstrate the effectiveness and generalization ability of our method
across three widely used databases.Comment: 11 pages,3 figure
Dynamic Facial Expression Generation on Hilbert Hypersphere with Conditional Wasserstein Generative Adversarial Nets
In this work, we propose a novel approach for generating videos of the six
basic facial expressions given a neutral face image. We propose to exploit the
face geometry by modeling the facial landmarks motion as curves encoded as
points on a hypersphere. By proposing a conditional version of manifold-valued
Wasserstein generative adversarial network (GAN) for motion generation on the
hypersphere, we learn the distribution of facial expression dynamics of
different classes, from which we synthesize new facial expression motions. The
resulting motions can be transformed to sequences of landmarks and then to
images sequences by editing the texture information using another conditional
Generative Adversarial Network. To the best of our knowledge, this is the first
work that explores manifold-valued representations with GAN to address the
problem of dynamic facial expression generation. We evaluate our proposed
approach both quantitatively and qualitatively on two public datasets;
Oulu-CASIA and MUG Facial Expression. Our experimental results demonstrate the
effectiveness of our approach in generating realistic videos with continuous
motion, realistic appearance and identity preservation. We also show the
efficiency of our framework for dynamic facial expressions generation, dynamic
facial expression transfer and data augmentation for training improved emotion
recognition models
Diagnóstico Diferencial Temprano en Población Infantil con Trastornos del Neurodesarrollo. Marcadores Psicofisiológicos Basados en Metodología de Seguimiento Ocular
Early diagnosis and intervention have a positive impact on prognosis and the quality of life of children with neurodevelopmental disorders (NDDs) and their families. In this sense, the aim of this thesis project is to contribute to the early differential diagnosis of children with NDDs. To do so, we firstly investigated the cognitive profiles of five NDD with some kind of language disruption using the Wechsler Intelligence Scale for Children (see Appendix 1). From this study, we realized two issues: (1) That cognitive profiles were not highly conclusive or helpful for the purpose of differential diagnosis and (2) That we needed to bring forward the age of detection of the disorder to make real progress in early diagnosis. Thus, we carried out a comprehensive review on the issue and we focused the project on the analysis of the emotional competence (EC) in young children with Autism Spectrum Disorders (ASD) and Developmental Language Disorder (DLD). On the one hand, EC has proved to be more conclusive in yielding differences between disorders at early ages. On the other hand, ASD and DLD are two different NDDs which have demonstrated many similarities in behavioral, cognitive, and linguistic profiles at early ages. This inevitably hampers early diagnosis and, consequently, early intervention as revealed in our study on the prevalence and impact of the ‘Diagnostic Migration’ phenomenon between ASD and DLD on early intervention (see Appendix 2). Thus, the present doctoral dissertation was conceived to contribute to differential diagnosis of these conditions at early ages regarding their abilities in EC. With this purpose, we firstly conducted a thorough review of the state of the art and we understood that the construct of EC was too broad to be fully covered in this project; however, we found some key abilities which had been revealed as promising in differentiating disorders at early ages. These abilities were related to the visual processing of social-emotional images, whose analysis required eye tracking methodology. Then, we designed three intertwined experimental eye tracking studies to yield new evidence on the matter. Therefore, the first part of this work comprises three chapters which unfold the current knowledge on early diagnosis of NDDs, the evaluation of the EC in children with ASD and those with DLD, and the potential of eye tracking methodology to contribute to the definition of both conditions as well as to differential diagnosis. The second part of this work includes four chapters describing the experimental studies carried out. The former one explains the rationale of the three studies as well as their goals, the participants involved in them, the main hypotheses, and the designs. The other three chapters present a detailed reproduction of each study. As it was previously mentioned, we applied an eye tracking methodology along with a paired preference paradigm in all studies to describe the way in which young children with ASD and DLD observe and process social-emotional images. In this sense, the eye tracking methodology allowed us to describe children eye movements during images visualization; while the paired preference paradigm (consisting of the presentation of pairs of images to analyze their saliency and the competing effect of each one on each other) enabled us to identify which stimuli were more visually salient for children and which were able to capture or prevent their attention. Thus, by applying these methodological considerations in our three studies we have unveiled some psychophysiological markers that may contributed to the early identification of children with ASD and DLD (e.g., late orientation to angry and child faces, emotional sensitivity -visual preference for emotional faces with respect to the neutral ones-, and more superficial facial processing compared to their typical controls). Finally, the third part of this work consists of one chapter discussing the main results derived from the whole project, stating its limitations, providing some guidelines for future research, and remarking some final conclusions. Hence, this doctoral dissertation scientifically contributes to the current knowledge on the field in several ways: (1) Providing an exhaustive revision of the insight into the social-emotional competence in child populations with ASD and DLD; (2) Developing and consolidating a specific methodology based on the eye tracking technology and the paired preference paradigm, which allows a thorough study of these clinical populations boosting their comprehension and differentiation; (3) Revealing psychophysiological markers in young children with ASD and potential descriptors of visual scanning of faces in young children with DLD; and (4) Indicating new pathways for addressing the issue in future studies
Recognition, Analysis, and Assessments of Human Skills using Wearable Sensors
One of the biggest social issues in mature societies such as Europe and Japan
is the aging population and declining birth rate. These societies have a serious
problem with the retirement of the expert workers, doctors, and engineers etc.
Especially in the sectors that require long time to make experts in fields like medicine and industry; the retirement and injuries of the experts, is a serious problem. The technology to support the training and assessment of skilled workers (like doctors, manufacturing
workers) is strongly required for the society. Although there are some solutions for
this problem, most of them are video-based which violates the privacy of the subjects.
Furthermore, they are not easy to deploy due to the need for large training data.
This thesis provides a novel framework to recognize, analyze, and assess human
skills with minimum customization cost. The presented framework tackles this problem
in two different domains, industrial setup and medical operations of catheter-based
cardiovascular interventions (CBCVI).
In particular, the contributions of this thesis are four-fold. First, it proposes an
easy-to-deploy framework for human activity recognition based on zero-shot learning
approach, which is based on learning basic actions and objects. The model recognizes
unseen activities by combinations of basic actions learned in a preliminary way and involved objects. Therefore, it is completely configurable by the user and can be used to detect completely new activities.
Second, a novel gaze-estimation model for attention driven object detection task is
presented. The key features of the model are: (i) usage of the deformable convolutional
layers to better incorporate spatial dependencies of different shapes of objects and
backgrounds, (ii) formulation of the gaze-estimation problem in two different way, as a
classification as well as a regression problem. We combine both formulations using a
joint loss that incorporates both the cross-entropy as well as the mean-squared error in
order to train our model. This enhanced the accuracy of the model from 6.8 by using only
the cross-entropy loss to 6.4 for the joint loss.
The third contribution of this thesis targets the area of quantification of quality of
i
actions using wearable sensor. To address the variety of scenarios, we have targeted two
possibilities: a) both expert and novice data is available , b) only expert data is available,
a quite common case in safety critical scenarios.
Both of the developed methods from these scenarios are deep learning based. In the
first one, we use autoencoders with OneClass SVM, and in the second one we use the
Siamese Networks. These methods allow us to encode the expert’s expertise and to learn
the differences between novice and expert workers. This enables quantification of the
performance of the novice in comparison to the expert worker.
The fourth contribution, explicitly targets medical practitioners and provides a
methodology for novel gaze-based temporal spatial analysis of CBCVI data. The developed
methodology allows continuous registration and analysis of gaze data for analysis
of the visual X-ray image processing (XRIP) strategies of expert operators in live-cases scenarios and may assist in transferring experts’ reading skills to novices
CAA-Net: Conditional Atrous CNNs with attention for explainable device-robust acoustic scene classification
Acoustic Scene Classification (ASC) aims to classify the environment in which
the audio signals are recorded. Recently, Convolutional Neural Networks (CNNs)
have been successfully applied to ASC. However, the data distributions of the
audio signals recorded with multiple devices are different. There has been
little research on the training of robust neural networks on acoustic scene
datasets recorded with multiple devices, and on explaining the operation of the
internal layers of the neural networks. In this article, we focus on training
and explaining device-robust CNNs on multi-device acoustic scene data. We
propose conditional atrous CNNs with attention for multi-device ASC. Our
proposed system contains an ASC branch and a device classification branch, both
modelled by CNNs. We visualise and analyse the intermediate layers of the
atrous CNNs. A time-frequency attention mechanism is employed to analyse the
contribution of each time-frequency bin of the feature maps in the CNNs. On the
Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 ASC
dataset, recorded with three devices, our proposed model performs significantly
better than CNNs trained on single-device data.Comment: IEEE Transactions on Multimedi
- …