2,698 research outputs found
Emotion recognition based on the energy distribution of plosive syllables
We usually encounter two problems during speech emotion recognition (SER): expression and perception problems, which vary considerably between speakers, languages, and sentence pronunciation. In fact, finding an optimal system that characterizes the emotions overcoming all these differences is a promising prospect. In this perspective, we considered two emotional databases: Moroccan Arabic dialect emotional database (MADED), and Ryerson audio-visual database on emotional speech and song (RAVDESS) which present notable differences in terms of type (natural/acted), and language (Arabic/English). We proposed a detection process based on 27 acoustic features extracted from consonant-vowel (CV) syllabic units: \ba, \du, \ki, \ta common to both databases. We tested two classification strategies: multiclass (all emotions combined: joy, sadness, neutral, anger) and binary (neutral vs. others, positive emotions (joy) vs. negative emotions (sadness, anger), sadness vs. anger). These strategies were tested three times: i) on MADED, ii) on RAVDESS, iii) on MADED and RAVDESS. The proposed method gave better recognition accuracy in the case of binary classification. The rates reach an average of 78% for the multi-class classification, 100% for neutral vs. other cases, 100% for the negative emotions (i.e. anger vs. sadness), and 96% for the positive vs. negative emotions
ArtELingo: A Million Emotion Annotations of WikiArt with Emphasis on Diversity over Language and Culture
This paper introduces ArtELingo, a new benchmark and dataset, designed to
encourage work on diversity across languages and cultures. Following ArtEmis, a
collection of 80k artworks from WikiArt with 0.45M emotion labels and
English-only captions, ArtELingo adds another 0.79M annotations in Arabic and
Chinese, plus 4.8K in Spanish to evaluate "cultural-transfer" performance. More
than 51K artworks have 5 annotations or more in 3 languages. This diversity
makes it possible to study similarities and differences across languages and
cultures. Further, we investigate captioning tasks, and find diversity improves
the performance of baseline models. ArtELingo is publicly available at
https://www.artelingo.org/ with standard splits and baseline models. We hope
our work will help ease future research on multilinguality and culturally-aware
AI.Comment: 9 pages, Accepted at EMNLP 22, for more details see
https://www.artelingo.org
Convolutional Neural Network Architectures for Gender, Emotional Detection from Speech and Speaker Diarization
This paper introduces three system architectures for speaker identification that aim to overcome the limitations of diarization and voice-based biometric systems. Diarization systems utilize unsupervised algorithms to segment audio data based on the time boundaries of utterances, but they do not distinguish individual speakers. On the other hand, voice-based biometric systems can only identify individuals in recordings with a single speaker. Identifying speakers in recordings of natural conversations can be challenging, especially when emotional shifts can alter voice characteristics, making gender identification difficult. To address this issue, the proposed architectures include techniques for gender, emotion, and diarization at either the segment or group level. The evaluation of these architectures utilized two speech databases, namely VoxCeleb and RAVDESS (Ryerson audio-visual database of emotional speech and song) datasets. The findings reveal that the proposed approach outperforms the strategy level in terms of recognition results, despite the real-time processing advantage of the latter. The challenge of identifying multiple speakers engaging in a conversation while considering emotional changes that impact speech is effectively addressed by the proposed architectures. The data indicates that the gender and emotion classification of diarization achieves an accuracy of over 98 percent. These results suggest that the proposed speech-based approach can achieve highly accurate speaker identification
Explaining (Sarcastic) Utterances to Enhance Affect Understanding in Multimodal Dialogues
Conversations emerge as the primary media for exchanging ideas and
conceptions. From the listener's perspective, identifying various affective
qualities, such as sarcasm, humour, and emotions, is paramount for
comprehending the true connotation of the emitted utterance. However, one of
the major hurdles faced in learning these affect dimensions is the presence of
figurative language, viz. irony, metaphor, or sarcasm. We hypothesize that any
detection system constituting the exhaustive and explicit presentation of the
emitted utterance would improve the overall comprehension of the dialogue. To
this end, we explore the task of Sarcasm Explanation in Dialogues, which aims
to unfold the hidden irony behind sarcastic utterances. We propose MOSES, a
deep neural network, which takes a multimodal (sarcastic) dialogue instance as
an input and generates a natural language sentence as its explanation.
Subsequently, we leverage the generated explanation for various natural
language understanding tasks in a conversational dialogue setup, such as
sarcasm detection, humour identification, and emotion recognition. Our
evaluation shows that MOSES outperforms the state-of-the-art system for SED by
an average of ~2% on different evaluation metrics, such as ROUGE, BLEU, and
METEOR. Further, we observe that leveraging the generated explanation advances
three downstream tasks for affect classification - an average improvement of
~14% F1-score in the sarcasm detection task and ~2% in the humour
identification and emotion recognition task. We also perform extensive analyses
to assess the quality of the results.Comment: Accepted at AAAI 2023. 11 Pages; 14 Tables; 3 Figure
An ongoing review of speech emotion recognition
User emotional status recognition is becoming a key feature in advanced Human Computer Interfaces (HCI). A key source of emotional information is the spoken expression, which may be part of the interaction between the human and the machine. Speech emotion recognition (SER) is a very active area of research that involves the application of current machine learning and neural networks tools. This ongoing review covers recent and classical approaches to SER reported in the literature.This work has been carried out with the support of project PID2020-116346GB-I00 funded by the Spanish MICIN
- …