73,003 research outputs found
Multiscale Contextual Learning for Speech Emotion Recognition in Emergency Call Center Conversations
Emotion recognition in conversations is essential for ensuring advanced
human-machine interactions. However, creating robust and accurate emotion
recognition systems in real life is challenging, mainly due to the scarcity of
emotion datasets collected in the wild and the inability to take into account
the dialogue context. The CEMO dataset, composed of conversations between
agents and patients during emergency calls to a French call center, fills this
gap. The nature of these interactions highlights the role of the emotional flow
of the conversation in predicting patient emotions, as context can often make a
difference in understanding actual feelings. This paper presents a multi-scale
conversational context learning approach for speech emotion recognition, which
takes advantage of this hypothesis. We investigated this approach on both
speech transcriptions and acoustic segments. Experimentally, our method uses
the previous or next information of the targeted segment. In the text domain,
we tested the context window using a wide range of tokens (from 10 to 100) and
at the speech turns level, considering inputs from both the same and opposing
speakers. According to our tests, the context derived from previous tokens has
a more significant influence on accurate prediction than the following tokens.
Furthermore, taking the last speech turn of the same speaker in the
conversation seems useful. In the acoustic domain, we conducted an in-depth
analysis of the impact of the surrounding emotions on the prediction. While
multi-scale conversational context learning using Transformers can enhance
performance in the textual modality for emergency call recordings,
incorporating acoustic context is more challenging
Removing Bias with Residual Mixture of Multi-View Attention for Speech Emotion Recognition
Speech emotion recognition is essential for obtaining emotional intelligence which affects the understanding of context and meaning of speech. The fundamental challenges of speech emotion recognition from a machine learning standpoint is to extract patterns which carry maximum correlation with the emotion information encoded in this signal, and to be as insensitive as possible to other types of information carried by speech. In this paper, a novel recurrent residual temporal context modelling framework is proposed. The framework includes mixture of multi-view attention smoothing and high dimensional feature projection for context expansion and learning feature representations. The framework is designed to be robust to changes in speaker and other distortions, and it provides state-of-the-art results for speech emotion recognition. Performance of the proposed approach is compared with a wide range of current architectures in a standard 4-class classification task on the widely used IEMOCAP corpus. A significant improvement of 4% unweighted accuracy over state-of-the-art systems is observed. Additionally, the attention vectors have been aligned with the input segments and plotted at two different attention levels to demonstrate the effectiveness
Learning temporal clusters using capsule routing for speech emotion recognition
Emotion recognition from speech plays a significant role in adding emotional intelligence to machines and making human-machine interaction more natural. One of the key challenges from machine learning standpoint is to extract patterns which bear maximum correlation with the emotion information encoded in this signal while being as insensitive as possible to other types of information carried by speech. In this paper, we propose a novel temporal modelling framework for robust emotion classification using bidirectional long short-term memory network (BLSTM), CNN and Capsule networks. The BLSTM deals with the temporal dynamics of the speech signal by effectively representing forward/backward contextual information while the CNN along with the dynamic routing of the Capsule net learn temporal clusters which altogether provide a state-of-the-art technique for classifying the extracted patterns. The proposed approach was compared with a wide range of architectures on the FAU-Aibo and RAVDESS corpora and remarkable gain over state-of-the-art systems were obtained. For FAO-Aibo and RAVDESS 77.6% and 56.2% accuracy was achieved, respectively, which is 3% and 14% (absolute) higher than the best-reported result for the respective tasks
The influence of pleasant and unpleasant odours on the acoustics of speech
Olfaction, i. e., the sense of smell is referred to as the ‘emotional sense’, as it has been shown to elicit affective responses. Yet, its influence on speech production has not been investigated. In this paper, we introduce a novel speech-based smell recognition approach, drawing from the fields of speech emotion recognition and personalised machine learning. In particular, we collected a corpus of 40 female speakers reading 2 short stories while either no scent, unpleasant odour (fish), or pleasant odour (peach) is applied through a nose clip. Further, we present a machine learning pipeline for the extraction of data representations, model training, and personalisation of the trained models. In a leave-one-speaker-out cross-validation, our best models trained on state-of-the-art wav2vec features achieve a classification rate of 68 % when distinguishing between speech produced under the influence of negative scent and no applied scent. In addition, we highlight the importance of personalisation approaches, showing that a speaker-based feature normalisation substantially improves performance across the evaluated experiments. In summary, the presented results indicate that odours have a weak, but measurable effect on the acoustics of speech
Speech emotion recognition using spectrogram based neural structured learning
Human emotions are extremely crucial in our daily life. Emotion analysis based solely on auditory data is difficult due to the lack of visible visual information on human faces. Thus, a unique emotion recognition system based on robust characteristics and machine learning from the audio speech is reported in this paper. Audio details are used as input to the person-independent emotion recognition system, from which the spectrogram values are extracted as features. The generated features are then used to train and understand the emotions via Neural Structured Learning (NSL), a fast and accurate deep learning approach. During studies on an emotion dataset of audio speeches, the proposed approach of integrating spectrogram and NSL produced improved recognition rates compared to other known models. The system can be used in smart environments like homes or clinics to provide effective healthcare, music recommendations, customer support, and marketing, among several other things. As a result, rather than processing data and making judgments from far distant data sources, the decision-making could be made closer to where the data lives. The Toronto Emotional Speech Set (TESS) dataset that contains 7 emotions has been used for this research. The algorithm is successfully tested with the dataset with an accuracy of ~97%
Unsupervised Representations Improve Supervised Learning in Speech Emotion Recognition
Speech Emotion Recognition (SER) plays a pivotal role in enhancing
human-computer interaction by enabling a deeper understanding of emotional
states across a wide range of applications, contributing to more empathetic and
effective communication. This study proposes an innovative approach that
integrates self-supervised feature extraction with supervised classification
for emotion recognition from small audio segments. In the preprocessing step,
to eliminate the need of crafting audio features, we employed a self-supervised
feature extractor, based on the Wav2Vec model, to capture acoustic features
from audio data. Then, the output featuremaps of the preprocessing step are fed
to a custom designed Convolutional Neural Network (CNN)-based model to perform
emotion classification. Utilizing the ShEMO dataset as our testing ground, the
proposed method surpasses two baseline methods, i.e. support vector machine
classifier and transfer learning of a pretrained CNN. comparing the propose
method to the state-of-the-art methods in SER task indicates the superiority of
the proposed method. Our findings underscore the pivotal role of deep
unsupervised feature learning in elevating the landscape of SER, offering
enhanced emotional comprehension in the realm of human-computer interactions
Speech emotion recognition with early visual cross-modal enhancement using spiking neural networks
Speech emotion recognition (SER) is an important part of affective computing and signal processing research areas. A number of approaches, especially deep learning techniques, have achieved promising results on SER. However, there are still challenges in translating temporal and dynamic changes in emotions through speech. Spiking Neural Networks (SNN) have demonstrated as a promising approach in machine learning and pattern recognition tasks such as handwriting and facial expression recognition. In this paper, we investigate the use of SNNs for SER tasks and more importantly we propose a new cross-modal enhancement approach. This method is inspired by the auditory information processing in the brain where auditory information is preceded, enhanced and predicted by a visual processing in multisensory audio-visual processing. We have conducted experiments on two datasets to compare our approach with the state-of-the-art SER techniques in both uni-modal and multi-modal aspects. The results have demonstrated that SNNs can be an ideal candidate for modeling temporal relationships in speech features and our cross-modal approach can significantly improve the accuracy of SER.Postprin
- …