17 research outputs found
Personality Trait Classification via Co-Occurrent Multiparty Multimodal Event Discovery
This paper proposes a novel feature extraction framework from mutli-party multimodal conversation for inference of personality traits and emergent leadership. The proposed framework represents multi modal features as the combination of each participant’s nonverbal activity and group activity. This feature representationenables to compare the nonverbal patterns extracted from the participants of different groups in a metric space. It captures how the target member outputs nonverbal behavior observed in a group (e.g. the member speaks while all members move their body), and can be available for any kind of multiparty conversation task. Frequent co-occurrent events are discovered using graph clustering from multimodal sequences. The proposed framework is applied for the ELEA corpus which is an audio visual dataset collected from groupmeetings. We evaluate the framework for binary classification task of 10 personality traits. Experimental results show that the model trained with co-occurrence features obtained higher accuracy than previously related work in 8 out of 10 traits. In addition, the co-occurrence features improve the accuracy from 2% up to 17%
Speaker Profiling in Multiparty Conversations
In conversational settings, individuals exhibit unique behaviors, rendering a
one-size-fits-all approach insufficient for generating responses by dialogue
agents. Although past studies have aimed to create personalized dialogue agents
using speaker persona information, they have relied on the assumption that the
speaker's persona is already provided. However, this assumption is not always
valid, especially when it comes to chatbots utilized in industries like
banking, hotel reservations, and airline bookings. This research paper aims to
fill this gap by exploring the task of Speaker Profiling in Conversations
(SPC). The primary objective of SPC is to produce a summary of persona
characteristics for each individual speaker present in a dialogue. To
accomplish this, we have divided the task into three subtasks: persona
discovery, persona-type identification, and persona-value extraction. Given a
dialogue, the first subtask aims to identify all utterances that contain
persona information. Subsequently, the second task evaluates these utterances
to identify the type of persona information they contain, while the third
subtask identifies the specific persona values for each identified type. To
address the task of SPC, we have curated a new dataset named SPICE, which comes
with specific labels. We have evaluated various baselines on this dataset and
benchmarked it with a new neural model, SPOT, which we introduce in this paper.
Furthermore, we present a comprehensive analysis of SPOT, examining the
limitations of individual modules both quantitatively and qualitatively.Comment: 10 pages, 3 figures, 12 table
Transfer Learning for Personality Perception via Speech Emotion Recognition
Holistic perception of affective attributes is an important human perceptual
ability. However, this ability is far from being realized in current affective
computing, as not all of the attributes are well studied and their
interrelationships are poorly understood. In this work, we investigate the
relationship between two affective attributes: personality and emotion, from a
transfer learning perspective. Specifically, we transfer Transformer-based and
wav2vec2-based emotion recognition models to perceive personality from speech
across corpora. Compared with previous studies, our results show that
transferring emotion recognition is effective for personality perception.
Moreoever, this allows for better use and exploration of small personality
corpora. We also provide novel findings on the relationship between personality
and emotion that will aid future research on holistic affect recognition.Comment: Accepted to INTERSPEECH 202
First impressions: A survey on vision-based apparent personality trait analysis
© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Personality analysis has been widely studied in psychology, neuropsychology, and signal processing fields, among others. From the past few years, it also became an attractive research area in visual computing. From the computational point of view, by far speech and text have been the most considered cues of information for analyzing personality. However, recently there has been an increasing interest from the computer vision community in analyzing personality from visual data. Recent computer vision approaches are able to accurately analyze human faces, body postures and behaviors, and use these information to infer apparent personality traits. Because of the overwhelming research interest in this topic, and of the potential impact that this sort of methods could have in society, we present in this paper an up-to-date review of existing vision-based approaches for apparent personality trait recognition. We describe seminal and cutting edge works on the subject, discussing and comparing their distinctive features and limitations. Future venues of research in the field are identified and discussed. Furthermore, aspects on the subjectivity in data labeling/evaluation, as well as current datasets and challenges organized to push the research on the field are reviewed.Peer ReviewedPostprint (author's final draft
Audio-visual deep learning regression of apparent personality
Treballs Finals de Grau d'Enginyeria Informà tica, Facultat de Matemà tiques, Universitat de Barcelona, Any: 2020, Director: Sergio Escalera Guerrero, Cristina Palmero Cantariño i Julio Jacques Junior[en] Personality perception is based on the relationship of the human being with the individuals of his surroundings. This kind of perception allows to obtain conclusions based on the analysis and interpretation of the observable, mainly face expressions, tone of voice and other nonverbal signals, allowing the construction of an apparent personality (or first impression) of people. Apparent personality (or first impressions) are subjective, and subjectivity is an inherent property of perception based exclusively on the point of view of each individual. In this project, we approximate such subjectivity using a multi-modal deep neural network with audiovisual
signals as input and a late fusion strategy of handcrafted features, achieving accurate results. The aim of this work is to perform an analysis of the influence of automatic prediction for apparent personality (based on the Big-Five model), of the following characteristics: raw audio, visual information (sequence of face images) and high-level features, including Ekman's universal basic emotions, gender and age. To this end, we have defined different modalities, performing combinations of them and determining how much they contribute to the regression of apparent personality traits. The most remarkable results obtained through the experiments performed are as follows: in all modalities, females have a higher average accuracy than men, except in the modality with only audio; for happy emotion, the best accuracy score is found in the Conscientiousness trait; Extraversion and Conscientiousness traits get the highest accuracy scores in almost all emotions; visual information is the one that most positively influences the results; the combination of high-level features chosen slightly improves the accuracy performance for predictions
Recommended from our members
Multimodal Human-Human-Robot Interactions (MHHRI) Dataset for Studying Personality and Engagement
In this paper we introduce a novel dataset, the Multimodal Human-Human-Robot-Interactions (MHHRI) dataset, with the aim of studying personality simultaneously in human-human interactions (HHI) and human-robot interactions (HRI) and its relationship with engagement. Multimodal data was collected during a controlled interaction study where dyadic interactions between two human participants and triadic interactions between two human participants and a robot took place with interactants asking a set of personal questions to each other. Interactions were recorded using two static and two dynamic cameras as well as two biosensors, and meta-data was collected by having participants fill in two types of questionnaires, for assessing their own personality traits and their perceived engagement with their partners (self labels) and for assessing personality traits of the other participants partaking in the study (acquaintance labels). As a proof of concept, we present baseline results for personality and engagement classification. Our results show that (i) trends in personality classification performance remain the same with respect to the self and the acquaintance labels across the HHI and HRI settings; (ii) for extroversion, the acquaintance labels yield better results as compared to the self labels; (iii) in general, multi-modality yields better performance for the classification of personality traits.This work was funded by the EPSRC under its IDEAS Factory Sandpits call on Digital Personhood (Grant Ref: EP/L00416X/1)
Towards Video Transformers for Automatic Human Analysis
[eng] With the aim of creating artificial systems capable of mirroring the nuanced understanding and interpretative powers inherent to human cognition, this thesis embarks on an exploration of the intersection between human analysis and Video Transformers. The objective is to harness the potential of Transformers, a promising architectural paradigm, to comprehend the intricacies of human interaction, thus paving the way for the development of empathetic and context-aware intelligent systems. In order to do so, we explore the whole Computer Vision pipeline, from data gathering, to deeply analyzing recent developments, through model design and experimentation.
Central to this study is the creation of UDIVA, an expansive multi-modal, multi-view dataset capturing dyadic face-to-face human interactions. Comprising 147 participants across 188 sessions, UDIVA integrates audio-visual recordings, heart-rate measurements, personality assessments, socio- demographic metadata, and conversational transcripts, establishing itself as the largest dataset for dyadic human interaction analysis up to this date. This dataset provides a rich context for probing the capabilities of Transformers within complex environments. In order to validate its utility, as well as to elucidate Transformers' ability to assimilate diverse contextual cues, we focus on addressing the challenge of personality regression within interaction scenarios. We first adapt an existing Video Transformer to handle multiple contextual sources and conduct rigorous experimentation. We empirically observe a progressive enhancement in model performance as more context is added, reinforcing the potential of Transformers to decode intricate human dynamics. Building upon these findings, the Dyadformer emerges as a novel architecture, adept at long-range modeling of dyadic interactions. By jointly modeling both participants in the interaction, as well as embedding multi- modal integration into the model itself, the Dyadformer surpasses the baseline and other concurrent approaches, underscoring Transformers' aptitude in deciphering multifaceted, noisy, and challenging tasks such as the analysis of human personality in interaction.
Nonetheless, these experiments unveil the ubiquitous challenges when training Transformers, particularly in managing overfitting due to their demand for extensive datasets. Consequently, we conclude this thesis with a comprehensive investigation into Video Transformers, analyzing topics ranging from architectural designs and training strategies, to input embedding and tokenization, traversing through multi-modality and specific applications. Across these, we highlight trends which optimally harness spatio-temporal representations that handle video redundancy and high dimensionality. A culminating performance comparison is conducted in the realm of video action classification, spotlighting strategies that exhibit superior efficacy, even compared to traditional CNN-based methods.[cat] Aquesta tesi busca crear sistemes artificials que reflecteixin les habilitats de comprensió i interpretació humanes a través de l'ús de Transformers per a vÃdeo. L'objectiu és utilitzar aquestes arquitectures per comprendre millor la interacció humana i desenvolupar sistemes intel·ligents i conscients de l'entorn. Això implica explorar à mplies à rees de la Visió per Computador, des de la recopilació de dades fins a l'anà lisi de l'estat de l'art i la prova experimental d'aquests models.
Una part essencial d'aquest estudi és la creació d'UDIVA, un ampli conjunt de dades multimodal i multivista que enregistra interaccions humanes cara a cara. Amb 147 participants i 188 sessions, UDIVA inclou contingut audiovisual, freqüència cardÃaca, perfils de personalitat, dades sociodemogrà fiques i transcripcions de les converses. És el conjunt de dades més gran conegut per a l'anà lisi de la interacció humana dià dica i proporciona un context ric per a l'estudi de les capacitats dels Transformers en entorns complexos. Per tal de validar la seva utilitat i les habilitats dels Transformers, ens centrem en la regressió de la personalitat. Inicialment, adaptem un Transformer de vÃdeo per integrar diverses fonts de context. Mitjançant experiments exhaustius, observem millores progressives en els resultats amb la inclusió de més context, confirmant la capacitat dels Transformers. Motivats per aquests resultats, desenvolupem el Dyadformer, una arquitectura per interaccions dià diques de llarga duració. Aquesta nova arquitectura considera simultà niament els dos participants en la interacció i incorpora la multimodalitat en un sol model. El Dyadformer supera la nostra proposta inicial i altres treballs similars, destacant la capacitat dels Transformers per abordar tasques complexes.
No obstant això, aquestos experiments revelen reptes d'entrenament dels Transformers, com el sobreajustament, per la seva necessitat de grans conjunts de dades. La tesi conclou amb una anà lisi profunda dels Transformers per a vÃdeo, incloent dissenys arquitectònics, estratègies d'entrenament, preprocessament de vÃdeos, tokenització i multimodalitat. S'identifiquen tendències per gestionar la redundà ncia i alta dimensionalitat de vÃdeos i es realitza una comparació de rendiment en la classificació d'accions a vÃdeo, destacant estratègies d'eficà cia superior als mètodes tradicionals basats en convolucions
Exploiting Group Structures to Infer Social Interactions From Videos
In this thesis, we consider the task of inferring the social interactions between humans by analyzing multi-modal data. Specifically, we attempt to solve some of the problems in interaction analysis, such as long-term deception detection, political deception detection, and impression prediction. In this work, we emphasize the importance of using knowledge about the group structure of the analyzed interactions. Previous works on the matter mostly neglected this aspect and analyzed a single subject at a time. Using the new Resistance dataset, collected by our collaborators, we approach the problem of long-term deception detection by designing a class of histogram-based features and a novel class of meta-features we callLiarRank. We develop a LiarOrNot model to identify spies in Resistance videos. We achieve AUCs of over 0.70 outperforming our baselines by 3% and human judges by 12%. For the problem of political deception, we first collect a dataset of videos and transcripts of 76 politicians from 18 countries making truthful and deceptive statements. We call it the Global Political Deception Dataset. We then show how to analyze the statements in a broader context by building a Video-Article-Topic graph. From this graph, we create a novel class of features called Deception Score that captures how controversial each topic is and how it affects the truthfulness of each statement. We show that our approach achieves 0.775 AUC outperforming competing baselines. Finally, we use the Resistance data to solve the problem of dyadic impression prediction. Our proposed Dyadic Impression Prediction System (DIPS) contains four major innovations: a novel class of features called emotion ranks, sign imbalance features derived from signed graphs theory, a novel method to align the facial expressions of subjects, and finally, we propose the concept of a multilayered stochastic network we call Temporal Delayed Network. Our DIPS architecture beats eight baselines from the literature, yielding statistically significant improvements of 19.9-30.8% in AUC
Modelling person-specific and multi-scale facial dynamics for automatic personality and depression analysis
‘To know oneself is true progress’. While one's identity is difficult to be fully described, a key part of it is one’s personality. Accurately understanding personality can benefit various aspects of human's life. There is convergent evidence suggesting that personality traits are marked by non-verbal facial expressions of emotions, which in theory means that automatic personality assessment is possible from facial behaviours. Thus, this thesis aims to develop video-based automatic personality analysis approaches. Specifically, two video-level dynamic facial behaviour representations are proposed for automatic personality traits estimation, namely person-specific representation and spectral representation, which focus on addressing three issues that have been frequently occurred in existing automatic personality analysis approaches: 1. attempting to use super short video segments or even a single frame to infer personality traits; 2. lack of proper way to retain multi-scale long-term temporal information; 3. lack of methods to encode person-specific facial dynamics that are relatively stable over time but differ across individuals.
This thesis starts with extending the dynamic image algorithm to modeling preceding and succeeding short-term face dynamics of each frame in a video, which achieved good performance in estimating valence/arousal intensities, showing good dynamic encoding ability of such dynamic representation. This thesis then proposes a novel Rank Loss, aiming to train a network that produces similar dynamic representation per-frame but only from a still image. This way, the network can learn generic facial dynamics from unlabelled face videos in a self-supervised manner. Based on such an approach, the person-specific representation encoding approach is proposed. It firstly freezes the well-trained generic network, and incorporates a set of intermediate filters, which are trained again but with only person-specific videos based on the same self-supervised learning approach. As a result, the learned filters' weights are person-specific, and can be concatenated as a 1-D video-level person-specific representation. Meanwhile, this thesis also proposes a spectral analysis approach to retain multi-scale video-level facial dynamics. This approach uses automatically detected human behaviour primitives as the low-dimensional descriptor for each frame, and converts long and variable-length time-series behaviour signals to small and length-independent spectral representations to represent video-level multi-scale temporal dynamics of expressive behaviours. Consequently, the combination of two representations, which contains not only multi-scale video-level facial dynamics but also person-specific video-level facial dynamics, can be applied to automatic personality estimation.
This thesis conducts a series of experiments to validate the proposed approaches: 1. the arousal/valence intensity estimation is conducted on both a controlled face video dataset (SEMAINE) and a wild face video dataset (Affwild-2), to evaluate the dynamic encoding capability of the proposed Rank Loss; 2. the proposed automatic personality traits recognition systems (spectral representation and person-specific representation) are evaluated on face video datasets that labelled with either 'Big-Five' apparent personality traits (ChaLearn) or self-reported personality traits (VHQ); 3. the depression studies are also evaluated on the VHQ dataset that is labelled with PHQ-9 depression scores. The experimental results on automatic personality traits and depression severity estimation tasks show the person-specific representation's good performance in personality task and spectral vector's superior performance in depression task. In particular, the proposed person-specific approach achieved a similar performance to the state-of-the-art method in apparent personality traits recognition task and achieved at least 15% PCC improvements over other approaches in self-reported personality traits recognition task. Meanwhile, the proposed spectral representation shows better performance than the person-specific approach in depression severity estimation task. In addition, this thesis also found that adding personality traits labels/predictions into behaviour descriptors improved depression severity estimation results