Search CORE

5,220 research outputs found

Speech Synthesis Based on Hidden Markov Models

Author: Nankaku Y.
Oura K.
Toda T.
Tokuda K.
Yamagishi J.
Zen H.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2013
Field of study

Adversarial Training in Affective Computing and Sentiment Analysis: Recent Advances and Perspectives

Author: Cummins Nicholas
Han Jing
Schuller Björn
Zhang Zixing
Publication venue
Publication date: 21/09/2018
Field of study

Over the past few years, adversarial training has become an extremely active research topic and has been successfully applied to various Artificial Intelligence (AI) domains. As a potentially crucial technique for the development of the next generation of emotional AI systems, we herein provide a comprehensive overview of the application of adversarial training to affective computing and sentiment analysis. Various representative adversarial training algorithms are explained and discussed accordingly, aimed at tackling diverse challenges associated with emotional AI systems. Further, we highlight a range of potential future research directions. We expect that this overview will help facilitate the development of adversarial training for affective computing and sentiment analysis in both the academic and industrial communities

arXiv.org e-Print Archive

OPUS Augsburg

A dataset of continuous affect annotations and physiological signals for emotion analysis

Author: Albu-Schaeffer Alin
Broek Egon L. van den
Castellini Claudio
Schwenker Friedhelm
Sharma Karan
Publication venue
Publication date: 06/12/2018
Field of study

From a computational viewpoint, emotions continue to be intriguingly hard to understand. In research, direct, real-time inspection in realistic settings is not possible. Discrete, indirect, post-hoc recordings are therefore the norm. As a result, proper emotion assessment remains a problematic issue. The Continuously Annotated Signals of Emotion (CASE) dataset provides a solution as it focusses on real-time continuous annotation of emotions, as experienced by the participants, while watching various videos. For this purpose, a novel, intuitive joystick-based annotation interface was developed, that allowed for simultaneous reporting of valence and arousal, that are instead often annotated independently. In parallel, eight high quality, synchronized physiological recordings (1000 Hz, 16-bit ADC) were made of ECG, BVP, EMG (3x), GSR (or EDA), respiration and skin temperature. The dataset consists of the physiological and annotation data from 30 participants, 15 male and 15 female, who watched several validated video-stimuli. The validity of the emotion induction, as exemplified by the annotation and physiological data, is also presented.Comment: Dataset available at: https://rmc.dlr.de/download/CASE_dataset/CASE_dataset.zi

arXiv.org e-Print Archive

Institute of Transport Research:Publications

Utrecht University Repository

Disentangling Prosody Representations with Unsupervised Speech Reconstruction

Author: Li Taihao
Pekarek-Rosin Theresa
Qu Leyuan
Ren Fuji
Weber Cornelius
Wermter Stefan
Publication venue
Publication date: 25/09/2023
Field of study

Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker identity in Automatic Speech Recognition (ASR) and speaker verification tasks respectively. However, it is still an open challenging research question to extract prosodic information because of the intrinsic association of different attributes, such as timbre and rhythm, and because of the need for supervised training schemes to achieve robust large-scale and speaker-independent ASR. The aim of this paper is to address the disentanglement of emotional prosody from speech based on unsupervised reconstruction. Specifically, we identify, design, implement and integrate three crucial components in our proposed speech reconstruction model Prosody2Vec: (1) a unit encoder that transforms speech signals into discrete units for semantic content, (2) a pretrained speaker verification model to generate speaker identity embeddings, and (3) a trainable prosody encoder to learn prosody representations. We first pretrain the Prosody2Vec representations on unlabelled emotional speech corpora, then fine-tune the model on specific datasets to perform Speech Emotion Recognition (SER) and Emotional Voice Conversion (EVC) tasks. Both objective (weighted and unweighted accuracies) and subjective (mean opinion score) evaluations on the EVC task suggest that Prosody2Vec effectively captures general prosodic features that can be smoothly transferred to other emotional speech. In addition, our SER experiments on the IEMOCAP dataset reveal that the prosody features learned by Prosody2Vec are complementary and beneficial for the performance of widely used speech pretraining models and surpass the state-of-the-art methods when combining Prosody2Vec with HuBERT representations.Comment: Accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processin

arXiv.org e-Print Archive

A Dual-Modality Emotion Recognition System of EEG and Facial Images and its Application in Educational Scene

Author: Li Ruijin
Publication venue
Publication date: 22/12/2022
Field of study

With the development of computer science, people's interactions with computers or through computers have become more frequent. Some human-computer interactions or human-to-human interactions that are often seen in daily life: online chat, online banking services, facial recognition functions, etc. Only through text messaging, however, can the effect of information transfer be reduced to around 30% of the original. Communication becomes truly efficient when we can see one other's reactions and feel each other's emotions. This issue is especially noticeable in the educational field. Offline teaching is a classic teaching style in which teachers may determine a student's present emotional state based on their expressions and alter teaching methods accordingly. With the advancement of computers and the impact of Covid-19, an increasing number of schools and educational institutions are exploring employing online or video-based instruction. In such circumstances, it is difficult for teachers to get feedback from students. Therefore, an emotion recognition method is proposed in this thesis that can be used for educational scenarios, which can help teachers quantify the emotional state of students in class and be used to guide teachers in exploring or adjusting teaching methods. Text, physiological signals, gestures, facial photographs, and other data types are commonly used for emotion recognition. Data collection for facial images emotion recognition is particularly convenient and fast among them, although there is a problem that people may subjectively conceal true emotions, resulting in inaccurate recognition results. Emotion recognition based on EEG waves can compensate for this drawback. Taking into account the aforementioned issues, this thesis first employs the SVM-PCA to classify emotions in EEG data, then employs the deep-CNN to classify the emotions of the subject's facial images. Finally, the D-S evidence theory is used for fusing and analyzing the two classification results and obtains the final emotion recognition accuracy of 92%. The specific research content of this thesis is as follows: 1) The background of emotion recognition systems used in teaching scenarios is discussed, as well as the use of various single modality systems for emotion recognition. 2) Detailed analysis of EEG emotion recognition based on SVM. The theory of EEG signal generation, frequency band characteristics, and emotional dimensions is introduced. The EEG signal is first filtered and processed with artifact removal. The processed EEG signal is then used for feature extraction using wavelet transforms. It is finally fed into the proposed SVM-PCA for emotion recognition and the accuracy is 64%. 3) Using the proposed deep-CNN to recognize emotions in facial images. Firstly, the Adaboost algorithm is used to detect and intercept the face area in the image, and the gray level balance is performed on the captured image. Then the preprocessed images are trained and tested using the deep-CNN, and the average accuracy is 88%. 4) Fusion method based on decision-making layer. The data fusion at the decision level is carried out with the results of EEG emotion recognition and facial expression emotion recognition. The final dual-modality emotion recognition results and system accuracy of 92% are obtained using D-S evidence theory. 5) The dual-modality emotion recognition system's data collection approach is designed. Based on the process, the actual data in the educational scene is collected and analyzed. The final accuracy of the dual-modality system is 82%. Teachers can use the emotion recognition results as a guide and reference to improve their teaching efficacy

UTUPub