3,415 research outputs found
First impressions: A survey on vision-based apparent personality trait analysis
© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Personality analysis has been widely studied in psychology, neuropsychology, and signal processing fields, among others. From the past few years, it also became an attractive research area in visual computing. From the computational point of view, by far speech and text have been the most considered cues of information for analyzing personality. However, recently there has been an increasing interest from the computer vision community in analyzing personality from visual data. Recent computer vision approaches are able to accurately analyze human faces, body postures and behaviors, and use these information to infer apparent personality traits. Because of the overwhelming research interest in this topic, and of the potential impact that this sort of methods could have in society, we present in this paper an up-to-date review of existing vision-based approaches for apparent personality trait recognition. We describe seminal and cutting edge works on the subject, discussing and comparing their distinctive features and limitations. Future venues of research in the field are identified and discussed. Furthermore, aspects on the subjectivity in data labeling/evaluation, as well as current datasets and challenges organized to push the research on the field are reviewed.Peer ReviewedPostprint (author's final draft
UR-FUNNY: A Multimodal Language Dataset for Understanding Humor
Humor is a unique and creative communicative behavior displayed during social
interactions. It is produced in a multimodal manner, through the usage of words
(text), gestures (vision) and prosodic cues (acoustic). Understanding humor
from these three modalities falls within boundaries of multimodal language; a
recent research trend in natural language processing that models natural
language as it happens in face-to-face communication. Although humor detection
is an established research area in NLP, in a multimodal context it is an
understudied area. This paper presents a diverse multimodal dataset, called
UR-FUNNY, to open the door to understanding multimodal language used in
expressing humor. The dataset and accompanying studies, present a framework in
multimodal humor detection for the natural language processing community.
UR-FUNNY is publicly available for research
Empathy Detection Using Machine Learning on Text, Audiovisual, Audio or Physiological Signals
Empathy is a social skill that indicates an individual's ability to
understand others. Over the past few years, empathy has drawn attention from
various disciplines, including but not limited to Affective Computing,
Cognitive Science and Psychology. Empathy is a context-dependent term; thus,
detecting or recognising empathy has potential applications in society,
healthcare and education. Despite being a broad and overlapping topic, the
avenue of empathy detection studies leveraging Machine Learning remains
underexplored from a holistic literature perspective. To this end, we
systematically collect and screen 801 papers from 10 well-known databases and
analyse the selected 54 papers. We group the papers based on input modalities
of empathy detection systems, i.e., text, audiovisual, audio and physiological
signals. We examine modality-specific pre-processing and network architecture
design protocols, popular dataset descriptions and availability details, and
evaluation protocols. We further discuss the potential applications, deployment
challenges and research gaps in the Affective Computing-based empathy domain,
which can facilitate new avenues of exploration. We believe that our work is a
stepping stone to developing a privacy-preserving and unbiased empathic system
inclusive of culture, diversity and multilingualism that can be deployed in
practice to enhance the overall well-being of human life
Multimodal Sentiment Analysis: Perceived vs Induced Sentiments
Social media has created a global network where people can easily access and
exchange vast information. This information gives rise to a variety of
opinions, reflecting both positive and negative viewpoints. GIFs stand out as a
multimedia format offering a visually engaging way for users to communicate. In
this research, we propose a multimodal framework that integrates visual and
textual features to predict the GIF sentiment. It also incorporates attributes
including face emotion detection and OCR generated captions to capture the
semantic aspects of the GIF. The developed classifier achieves an accuracy of
82.7% on Twitter GIFs, which is an improvement over state-of-the-art models.
Moreover, we have based our research on the ReactionGIF dataset, analysing the
variance in sentiment perceived by the author and sentiment induced in the
reade
Learning semantic sentence representations from visually grounded language without lexical knowledge
Current approaches to learning semantic representations of sentences often
use prior word-level knowledge. The current study aims to leverage visual
information in order to capture sentence level semantics without the need for
word embeddings. We use a multimodal sentence encoder trained on a corpus of
images with matching text captions to produce visually grounded sentence
embeddings. Deep Neural Networks are trained to map the two modalities to a
common embedding space such that for an image the corresponding caption can be
retrieved and vice versa. We show that our model achieves results comparable to
the current state-of-the-art on two popular image-caption retrieval benchmark
data sets: MSCOCO and Flickr8k. We evaluate the semantic content of the
resulting sentence embeddings using the data from the Semantic Textual
Similarity benchmark task and show that the multimodal embeddings correlate
well with human semantic similarity judgements. The system achieves
state-of-the-art results on several of these benchmarks, which shows that a
system trained solely on multimodal data, without assuming any word
representations, is able to capture sentence level semantics. Importantly, this
result shows that we do not need prior knowledge of lexical level semantics in
order to model sentence level semantics. These findings demonstrate the
importance of visual information in semantics
Speech-based recognition of self-reported and observed emotion in a dimensional space
The differences between self-reported and observed emotion have only marginally been investigated in the context of speech-based automatic emotion recognition. We address this issue by comparing self-reported emotion ratings to observed emotion ratings and look at how differences between these two types of ratings affect the development and performance of automatic emotion recognizers developed with these ratings. A dimensional approach to emotion modeling is adopted: the ratings are based on continuous arousal and valence scales. We describe the TNO-Gaming Corpus that contains spontaneous vocal and facial expressions elicited via a multiplayer videogame and that includes emotion annotations obtained via self-report and observation by outside observers. Comparisons show that there are discrepancies between self-reported and observed emotion ratings which are also reflected in the performance of the emotion recognizers developed. Using Support Vector Regression in combination with acoustic and textual features, recognizers of arousal and valence are developed that can predict points in a 2-dimensional arousal-valence space. The results of these recognizers show that the self-reported emotion is much harder to recognize than the observed emotion, and that averaging ratings from multiple observers improves performance
- …