15 research outputs found
Towards Multimodal Prediction of Spontaneous Humour: A Novel Dataset and First Results
Humour is a substantial element of human affect and cognition. Its automatic
understanding can facilitate a more naturalistic human-device interaction and
the humanisation of artificial intelligence. Current methods of humour
detection are solely based on staged data making them inadequate for
'real-world' applications. We address this deficiency by introducing the novel
Passau-Spontaneous Football Coach Humour (Passau-SFCH) dataset, comprising of
about 11 hours of recordings. The Passau-SFCH dataset is annotated for the
presence of humour and its dimensions (sentiment and direction) as proposed in
Martin's Humor Style Questionnaire. We conduct a series of experiments,
employing pretrained Transformers, convolutional neural networks, and
expert-designed features. The performance of each modality (text, audio, video)
for spontaneous humour recognition is analysed and their complementarity is
investigated. Our findings suggest that for the automatic analysis of humour
and its sentiment, facial expressions are most promising, while humour
direction can be best modelled via text-based features. The results reveal
considerable differences among various subjects, highlighting the individuality
of humour usage and style. Further, we observe that a decision-level fusion
yields the best recognition result. Finally, we make our code publicly
available at https://www.github.com/EIHW/passau-sfch. The Passau-SFCH dataset
is available upon request.Comment: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessible (Major Revision
Zero-shot personalization of speech foundation models for depressed mood monitoring
The monitoring of depressed mood plays an important role as a diagnostic tool in psychotherapy. An automated analysis of speech can provide a non-invasive measurement of a patient’s affective state. While speech has been shown to be a useful biomarker for depression, existing approaches mostly build population-level models that aim to predict each individual’s diagnosis as a (mostly) static property. Because of inter-individual differences in symptomatology and mood regulation behaviors, these approaches are ill-suited to detect smaller temporal variations in depressed mood. We address this issue by introducing a zero-shot personalization of large speech foundation models. Compared with other personalization strategies, our work does not require labeled speech samples for enrollment. Instead, the approach makes use of adapters conditioned on subject-specific metadata. On a longitudinal dataset, we show that the method improves performance compared with a set of suitable baselines. Finally, applying our personalization strategy improves individual-level fairness
Ecology & computer audition: applications of audio technology to monitor organisms and environment
Among the 17 Sustainable Development Goals (SDGs) proposed within the 2030 Agenda and adopted by all the United Nations member states, the 13th SDG is a call for action to combat climate change. Moreover, SDGs 14 and 15 claim the protection and conservation of life below water and life on land, respectively. In this work, we provide a literature-founded overview of application areas, in which computer audition – a powerful but in this context so far hardly considered technology, combining audio signal processing and machine intelligence – is employed to monitor our ecosystem with the potential to identify ecologically critical processes or states. We distinguish between applications related to organisms, such as species richness analysis and plant health monitoring, and applications related to the environment, such as melting ice monitoring or wildfire detection. This work positions computer audition in relation to alternative approaches by discussing methodological strengths and limitations, as well as ethical aspects. We conclude with an urgent call to action to the research community for a greater involvement of audio intelligence methodology in future ecosystem monitoring approaches
Audio self-supervised learning: a survey
Inspired by the humans' cognitive ability to generalise knowledge and skills,
Self-Supervised Learning (SSL) targets at discovering general representations
from large-scale data without requiring human annotations, which is an
expensive and time consuming task. Its success in the fields of computer vision
and natural language processing have prompted its recent adoption into the
field of audio and speech processing. Comprehensive reviews summarising the
knowledge in audio SSL are currently missing. To fill this gap, in the present
work, we provide an overview of the SSL methods used for audio and speech
processing applications. Herein, we also summarise the empirical works that
exploit the audio modality in multi-modal SSL frameworks, and the existing
suitable benchmarks to evaluate the power of SSL in the computer audition
domain. Finally, we discuss some open problems and point out the future
directions on the development of audio SSL
The MuSe 2022 Multimodal Sentiment Analysis Challenge: Humor, Emotional Reactions, and Stress
The Multimodal Sentiment Analysis Challenge (MuSe) 2022 is dedicated to
multimodal sentiment and emotion recognition. For this year's challenge, we
feature three datasets: (i) the Passau Spontaneous Football Coach Humor
(Passau-SFCH) dataset that contains audio-visual recordings of German football
coaches, labelled for the presence of humour; (ii) the Hume-Reaction dataset in
which reactions of individuals to emotional stimuli have been annotated with
respect to seven emotional expression intensities, and (iii) the Ulm-Trier
Social Stress Test (Ulm-TSST) dataset comprising of audio-visual data labelled
with continuous emotion values (arousal and valence) of people in stressful
dispositions. Using the introduced datasets, MuSe 2022 2022 addresses three
contemporary affective computing problems: in the Humor Detection Sub-Challenge
(MuSe-Humor), spontaneous humour has to be recognised; in the Emotional
Reactions Sub-Challenge (MuSe-Reaction), seven fine-grained `in-the-wild'
emotions have to be predicted; and in the Emotional Stress Sub-Challenge
(MuSe-Stress), a continuous prediction of stressed emotion values is featured.
The challenge is designed to attract different research communities,
encouraging a fusion of their disciplines. Mainly, MuSe 2022 targets the
communities of audio-visual emotion recognition, health informatics, and
symbolic sentiment analysis. This baseline paper describes the datasets as well
as the feature sets extracted from them. A recurrent neural network with LSTM
cells is used to set competitive baseline results on the test partitions for
each sub-challenge. We report an Area Under the Curve (AUC) of .8480 for
MuSe-Humor; .2801 mean (from 7-classes) Pearson's Correlations Coefficient for
MuSe-Reaction, as well as .4931 Concordance Correlation Coefficient (CCC) and
.4761 for valence and arousal in MuSe-Stress, respectively.Comment: Preliminary baseline paper for the 3rd Multimodal Sentiment Analysis
Challenge (MuSe) 2022, a full-day workshop at ACM Multimedia 202
Personalised depression forecasting using mobile sensor data and ecological momentary assessment
Introduction
Digital health interventions are an effective way to treat depression, but it is still largely unclear how patients’ individual symptoms evolve dynamically during such treatments. Data-driven forecasts of depressive symptoms would allow to greatly improve the personalisation of treatments. In current forecasting approaches, models are often trained on an entire population, resulting in a general model that works overall, but does not translate well to each individual in clinically heterogeneous, real-world populations. Model fairness across patient subgroups is also frequently overlooked. Personalised models tailored to the individual patient may therefore be promising.
Methods
We investigate different personalisation strategies using transfer learning, subgroup models, as well as subject-dependent standardisation on a newly-collected, longitudinal dataset of depression patients undergoing treatment with a digital intervention (N=65 patients recruited). Both passive mobile sensor data as well as ecological momentary assessments were available for modelling. We evaluated the models’ ability to predict symptoms of depression (Patient Health Questionnaire-2; PHQ-2) at the end of each day, and to forecast symptoms of the next day.
Results
In our experiments, we achieve a best mean-absolute-error (MAE) of 0.801 (25% improvement) for predicting PHQ-2 values at the end of the day with subject-dependent standardisation compared to a non-personalised baseline (MAE=1.062). For one day ahead-forecasting, we can improve the baseline of 1.539 by 12% to a MAE of 1.349 using a transfer learning approach with shared common layers. In addition, personalisation leads to fairer models at group-level.
Discussion
Our results suggest that personalisation using subject-dependent standardisation and transfer learning can improve predictions and forecasts, respectively, of depressive symptoms in participants of a digital depression intervention. We discuss technical and clinical limitations of this approach, avenues for future investigations, and how personalised machine learning architectures may be implemented to improve existing digital interventions for depression
Toward Detecting and Addressing Corner Cases in Deep Learning Based Medical Image Segmentation
Translating machine learning research into clinical practice has several challenges. In this paper, we identify some critical issues in translating research to clinical practice in the context of medical image segmentation and propose strategies to systematically address these challenges. Specifically, we focus on cases where the model yields erroneous segmentation, which we define as corner cases. One of the standard metrics used for reporting the performance of medical image segmentation algorithms is the average Dice score across all patients. We have discovered that this aggregate reporting has the inherent drawback that the corner cases where the algorithm or model has erroneous performance or very low metrics go unnoticed. Due to this reporting, models that report superior performance could end up producing completely erroneous results, or even anatomically impossible results in a few challenging cases, albeit without being noticed.We have demonstrated how corner cases go unnoticed using the Magnetic Resonance (MR) cardiac image segmentation task of the Automated Cardiac Diagnosis Challenge (ACDC) challenge. To counter this drawback, we propose a framework that helps to identify and report corner cases. Further, we propose a novel balanced checkpointing scheme capable of finding a solution that has superior performance even on these corner cases. Our proposed scheme leads to an improvement of 44.6% for LV, 46.1% for RV and 38.1% for the Myocardium on our identified corner case in the ACDC segmentation challenge. Further, we establish the generalisability of our proposed framework by also demonstrating its applicability in the context of chest X-ray lung segmentation. This framework has broader applications across multiple deep learning tasks even beyond medical image segmentation
Computational charisma—A brick by brick blueprint for building charismatic artificial intelligence
Charisma is considered as one's ability to attract and potentially influence others. Clearly, there can be considerable interest from an artificial intelligence's (AI) perspective to provide it with such skill. Beyond, a plethora of use cases opens up for computational measurement of human charisma, such as for tutoring humans in the acquisition of charisma, mediating human-to-human conversation, or identifying charismatic individuals in big social data. While charisma is a subject of research in its own right, a number of models exist that base it on various “pillars,” that is, dimensions, often following the idea that charisma is given if someone could and would help others. Examples of such pillars, therefore, include influence (could help) and affability (would help) in scientific studies, or power (could help), presence, and warmth (both would help) as a popular concept. Modeling high levels in these dimensions, i. e., high influence and high affability, or high power, presence, and warmth for charismatic AI of the future, e. g., for humanoid robots or virtual agents, seems accomplishable. Beyond, also automatic measurement appears quite feasible with the recent advances in the related fields of Affective Computing and Social Signal Processing. Here, we therefore present a brick by brick blueprint for building machines that can appear charismatic, but also analyse the charisma of others. We first approach the topic very broadly and discuss how the foundation of charisma is defined from a psychological perspective. Throughout the manuscript, the building blocks (bricks) then become more specific and provide concrete groundwork for capturing charisma through artificial intelligence (AI). Following the introduction of the concept of charisma, we switch to charisma in spoken language as an exemplary modality that is essential for human-human and human-computer conversations. The computational perspective then deals with the recognition and generation of charismatic behavior by AI. This includes an overview of the state of play in the field and the aforementioned blueprint. We then list exemplary use cases of computational charismatic skills. The building blocks of application domains and ethics conclude the article
The MuSe 2023 Multimodal Sentiment Analysis Challenge: Mimicked Emotions, Cross-Cultural Humour, and Personalisation
The MuSe 2023 is a set of shared tasks addressing three different
contemporary multimodal affect and sentiment analysis problems: In the Mimicked
Emotions Sub-Challenge (MuSe-Mimic), participants predict three continuous
emotion targets. This sub-challenge utilises the Hume-Vidmimic dataset
comprising of user-generated videos. For the Cross-Cultural Humour Detection
Sub-Challenge (MuSe-Humour), an extension of the Passau Spontaneous Football
Coach Humour (Passau-SFCH) dataset is provided. Participants predict the
presence of spontaneous humour in a cross-cultural setting. The Personalisation
Sub-Challenge (MuSe-Personalisation) is based on the Ulm-Trier Social Stress
Test (Ulm-TSST) dataset, featuring recordings of subjects in a stressed
situation. Here, arousal and valence signals are to be predicted, whereas parts
of the test labels are made available in order to facilitate personalisation.
MuSe 2023 seeks to bring together a broad audience from different research
communities such as audio-visual emotion recognition, natural language
processing, signal processing, and health informatics. In this baseline paper,
we introduce the datasets, sub-challenges, and provided feature sets. As a
competitive baseline system, a Gated Recurrent Unit (GRU)-Recurrent Neural
Network (RNN) is employed. On the respective sub-challenges' test datasets, it
achieves a mean (across three continuous intensity targets) Pearson's
Correlation Coefficient of .4727 for MuSe-Mimic, an Area Under the Curve (AUC)
value of .8310 for MuSe-Humor and Concordance Correlation Coefficient (CCC)
values of .7482 for arousal and .7827 for valence in the MuSe-Personalisation
sub-challenge.Comment: Baseline paper for the 4th Multimodal Sentiment Analysis Challenge
(MuSe) 2023, a workshop at ACM Multimedia 202
HEAR4Health: a blueprint for making computer audition a staple of modern healthcare
Recent years have seen a rapid increase in digital medicine research in an attempt to transform traditional healthcare systems to their modern, intelligent, and versatile equivalents that are adequately equipped to tackle contemporary challenges. This has led to a wave of applications that utilise AI technologies; first and foremost in the fields of medical imaging, but also in the use of wearables and other intelligent sensors. In comparison, computer audition can be seen to be lagging behind, at least in terms of commercial interest. Yet, audition has long been a staple assistant for medical practitioners, with the stethoscope being the quintessential sign of doctors around the world. Transforming this traditional technology with the use of AI entails a set of unique challenges. We categorise the advances needed in four key pillars: Hear, corresponding to the cornerstone technologies needed to analyse auditory signals in real-life conditions; Earlier, for the advances needed in computational and data efficiency; Attentively, for accounting to individual differences and handling the longitudinal nature of medical data; and, finally, Responsibly, for ensuring compliance to the ethical standards accorded to the field of medicine