152,858 research outputs found
Continuous Estimation of Emotions in Speech by Dynamic Cooperative Speaker Models
Automatic emotion recognition from speech has been recently focused on the prediction of time-continuous dimensions (e.g., arousal and valence) of spontaneous and realistic expressions of emotion, as found in real-life interactions. However, the automatic prediction of such emotions poses several challenges, such as the subjectivity found in the definition of a gold standard from a pool of raters and the issue of data scarcity in training models. In this work, we introduce a novel emotion recognition system, based on ensemble of single-speaker-regression-models (SSRMs). The estimation of emotion is provided by combining a subset of the initial pool of SSRMs selecting those that are most concordance among them. The proposed approach allows the addition or removal of speakers from the ensemble without the necessity to re-build the entire machine learning system. The simplicity of this aggregation strategy, coupled with the flexibility assured by the modular architecture, and the promising results obtained on the RECOLA database highlight the potential implications of the proposed method in a real-life scenario and in particular in WEB-based applications
Speech Emotion Diarization: Which Emotion Appears When?
Speech Emotion Recognition (SER) typically relies on utterance-level
solutions. However, emotions conveyed through speech should be considered as
discrete speech events with definite temporal boundaries, rather than
attributes of the entire utterance. To reflect the fine-grained nature of
speech emotions, we propose a new task: Speech Emotion Diarization (SED). Just
as Speaker Diarization answers the question of "Who speaks when?", Speech
Emotion Diarization answers the question of "Which emotion appears when?". To
facilitate the evaluation of the performance and establish a common benchmark
for researchers, we introduce the Zaion Emotion Dataset (ZED), an openly
accessible speech emotion dataset that includes non-acted emotions recorded in
real-life conditions, along with manually-annotated boundaries of emotion
segments within the utterance. We provide competitive baselines and open-source
the code and the pre-trained models
automaatne kõnepõhine emotsioonituvastus
The main objectives of affective computing is the study and creation of computer systems which can detect human affects. For speech-based emotion recognition, universal features offering the best performance for all languages have not yet been found. In this thesis, a speech-based emotion recognition system using a novel set of features is created. Support vector machines are used as classifiers in the offline system on Surrey Audio-Visual Expressed Emotion database, Berlin Database of Emotional Speech, Polish Emotional Speech database and Serbian emotional
speech database. Average emotion recognition rates of 80.21%, 88.6%, 75.42% and 93.41% are achieved, respectively, with a total number of 87 features. The online system, which uses Random Forests as it’s classifier, consists of two models trained on reduced versions of the first and second database, with the first model trained on only male samples and the second trained on both. The main purpose of the online system was to test the features’ usability in real-life scenarios and to explore the effects of gender in speech-based emotion recognition. To test the online system, two female and two male non-native English speakers recorded emotionally spoken sentences and used these as inputs to the trained model. Averaging over all emotions and speakers per model, it is seen that the features offer better performance than random guessing,
achieving 28% emotion recognition in both models. The average recognition rate for female speakers was 19% in the first and 29% in the second model. For male speakers, the rates were 36% and 28%, respectively. These results show how having more samples for training for a particular gender affects emotion recognition rates in a trained model
Multiscale Contextual Learning for Speech Emotion Recognition in Emergency Call Center Conversations
Emotion recognition in conversations is essential for ensuring advanced
human-machine interactions. However, creating robust and accurate emotion
recognition systems in real life is challenging, mainly due to the scarcity of
emotion datasets collected in the wild and the inability to take into account
the dialogue context. The CEMO dataset, composed of conversations between
agents and patients during emergency calls to a French call center, fills this
gap. The nature of these interactions highlights the role of the emotional flow
of the conversation in predicting patient emotions, as context can often make a
difference in understanding actual feelings. This paper presents a multi-scale
conversational context learning approach for speech emotion recognition, which
takes advantage of this hypothesis. We investigated this approach on both
speech transcriptions and acoustic segments. Experimentally, our method uses
the previous or next information of the targeted segment. In the text domain,
we tested the context window using a wide range of tokens (from 10 to 100) and
at the speech turns level, considering inputs from both the same and opposing
speakers. According to our tests, the context derived from previous tokens has
a more significant influence on accurate prediction than the following tokens.
Furthermore, taking the last speech turn of the same speaker in the
conversation seems useful. In the acoustic domain, we conducted an in-depth
analysis of the impact of the surrounding emotions on the prediction. While
multi-scale conversational context learning using Transformers can enhance
performance in the textual modality for emergency call recordings,
incorporating acoustic context is more challenging
Computational Intelligence and Human- Computer Interaction: Modern Methods and Applications
The present book contains all of the articles that were accepted and published in the Special Issue of MDPI’s journal Mathematics titled "Computational Intelligence and Human–Computer Interaction: Modern Methods and Applications". This Special Issue covered a wide range of topics connected to the theory and application of different computational intelligence techniques to the domain of human–computer interaction, such as automatic speech recognition, speech processing and analysis, virtual reality, emotion-aware applications, digital storytelling, natural language processing, smart cars and devices, and online learning. We hope that this book will be interesting and useful for those working in various areas of artificial intelligence, human–computer interaction, and software engineering as well as for those who are interested in how these domains are connected in real-life situations
The Perception of Emotion from Acoustic Cues in Natural Speech
Knowledge of human perception of emotional speech is imperative for the development of emotion in speech recognition systems and emotional speech synthesis. Owing to the fact that there is a growing trend towards research on spontaneous, real-life data, the aim of the present thesis is to examine human perception of emotion in naturalistic speech. Although there are many available emotional speech corpora, most contain simulated expressions. Therefore, there remains a compelling need to obtain naturalistic speech corpora that are appropriate and freely available for research. In that regard, our initial aim was to acquire suitable naturalistic material and examine its emotional content based on listener perceptions. A web-based listening tool was developed to accumulate ratings based on large-scale listening groups. The emotional content present in the speech material was demonstrated by performing perception tests on conveyed levels of Activation and Evaluation. As a result, labels were determined that signified the emotional content, and thus contribute to the construction of a naturalistic emotional speech corpus. In line with the literature, the ratings obtained from the perception tests suggested that Evaluation (or hedonic valence) is not identified as reliably as Activation is. Emotional valence can be conveyed through both semantic and prosodic information, for which the meaning of one may serve to facilitate, modify, or conflict with the meaning of the other—particularly with naturalistic speech. The subsequent experiments aimed to investigate this concept by comparing ratings from perception tests of non-verbal speech with verbal speech. The method used to render non-verbal speech was low-pass filtering, and for this, suitable filtering conditions were determined by carrying out preliminary perception tests. The results suggested that nonverbal naturalistic speech provides sufficiently discernible levels of Activation and Evaluation. It appears that the perception of Activation and Evaluation is affected by low-pass filtering, but that the effect is relatively small. Moreover, the results suggest that there is a similar trend in agreement levels between verbal and non-verbal speech. To date it still remains difficult to determine unique acoustical patterns for hedonic valence of emotion, which may be due to inadequate labels or the incorrect selection of acoustic parameters. This study has implications for the labelling of emotional speech data and the determination of salient acoustic correlates of emotion
Analysis and automatic identification of spontaneous emotions in speech from human-human and human-machine communication
383 p.This research mainly focuses on improving our understanding of human-human and human-machineinteractions by analysing paricipants¿ emotional status. For this purpose, we have developed andenhanced Speech Emotion Recognition (SER) systems for both interactions in real-life scenarios,explicitly emphasising the Spanish language. In this framework, we have conducted an in-depth analysisof how humans express emotions using speech when communicating with other persons or machines inactual situations. Thus, we have analysed and studied the way in which emotional information isexpressed in a variety of true-to-life environments, which is a crucial aspect for the development of SERsystems. This study aimed to comprehensively understand the challenge we wanted to address:identifying emotional information on speech using machine learning technologies. Neural networks havebeen demonstrated to be adequate tools for identifying events in speech and language. Most of themaimed to make local comparisons between some specific aspects; thus, the experimental conditions weretailored to each particular analysis. The experiments across different articles (from P1 to P19) are hardlycomparable due to our continuous learning of dealing with the difficult task of identifying emotions inspeech. In order to make a fair comparison, additional unpublished results are presented in the Appendix.These experiments were carried out under identical and rigorous conditions. This general comparisonoffers an overview of the advantages and disadvantages of the different methodologies for the automaticrecognition of emotions in speech
- …