40 research outputs found

    Improving Noise Robustness In Speaker Identification Using A Two-Stage Attention Model

    Full text link
    While the use of deep neural networks has significantly boosted speaker recognition performance, it is still challenging to separate speakers in poor acoustic environments. To improve robustness of speaker recognition system performance in noise, a novel two-stage attention mechanism which can be used in existing architectures such as Time Delay Neural Networks (TDNNs) and Convolutional Neural Networks (CNNs) is proposed. Noise is known to often mask important information in both time and frequency domain. The proposed mechanism allows the models to concentrate on reliable time/frequency components of the signal. The proposed approach is evaluated using the Voxceleb1 dataset, which aims at assessment of speaker recognition in real world situations. In addition three types of noise at different signal-noise-ratios (SNRs) were added for this work. The proposed mechanism is compared with three strong baselines: X-vectors, Attentive X-vector, and Resnet-34. Results on both identification and verification tasks show that the two-stage attention mechanism consistently improves upon these for all noise conditions.Comment: Submitted to Interspeech202

    Addressing Ambiguity of Emotion Labels Through Meta-Learning

    Full text link
    Emotion labels in emotion recognition corpora are highly noisy and ambiguous, due to the annotators' subjective perception of emotions. Such ambiguity may introduce errors in automatic classification and affect the overall performance. We therefore propose a dynamic label correction and sample contribution weight estimation model. Our model is based on a standard BLSTM model with attention with two extra parameters. The first learns a new corrected label distribution, and is aimed to fix the inaccurate labels from the dataset. The other instead estimates the contribution of each sample to the training process, and is aimed to ignore the ambiguous and noisy samples while giving higher weight to the clear ones. We train our model through an alternating optimization method, where in the first epoch we update the neural network parameters, and in the second we keep them fixed to update the label correction and sample importance parameters. When training and evaluating our model on the IEMOCAP dataset, we obtained a weighted accuracy (WA) and unweighted accuracy (UA) of respectively 65.9% and 61.4%. This yielded an absolute improvement of 2.5%, 2.7% respectively compared to a BLSTM with attention baseline, trained on the corpus gold labels.Comment: Submitted to ICASSP 202

    EigenEmo: Spectral Utterance Representation Using Dynamic Mode Decomposition for Speech Emotion Classification

    Full text link
    Human emotional speech is, by its very nature, a variant signal. This results in dynamics intrinsic to automatic emotion classification based on speech. In this work, we explore a spectral decomposition method stemming from fluid-dynamics, known as Dynamic Mode Decomposition (DMD), to computationally represent and analyze the global utterance-level dynamics of emotional speech. Specifically, segment-level emotion-specific representations are first learned through an Emotion Distillation process. This forms a multi-dimensional signal of emotion flow for each utterance, called Emotion Profiles (EPs). The DMD algorithm is then applied to the resultant EPs to capture the eigenfrequencies, and hence the fundamental transition dynamics of the emotion flow. Evaluation experiments using the proposed approach, which we call EigenEmo, show promising results. Moreover, due to the positive combination of their complementary properties, concatenating the utterance representations generated by EigenEmo with simple EPs averaging yields noticeable gains

    Recognizing More Emotions with Less Data Using Self-supervised Transfer Learning

    Full text link
    We propose a novel transfer learning method for speech emotion recognition allowing us to obtain promising results when only few training data is available. With as low as 125 examples per emotion class, we were able to reach a higher accuracy than a strong baseline trained on 8 times more data. Our method leverages knowledge contained in pre-trained speech representations extracted from models trained on a more general self-supervised task which doesn't require human annotations, such as the wav2vec model. We provide detailed insights on the benefits of our approach by varying the training data size, which can help labeling teams to work more efficiently. We compare performance with other popular methods on the IEMOCAP dataset, a well-benchmarked dataset among the Speech Emotion Recognition (SER) research community. Furthermore, we demonstrate that results can be greatly improved by combining acoustic and linguistic knowledge from transfer learning. We align acoustic pre-trained representations with semantic representations from the BERT model through an attention-based recurrent neural network. Performance improves significantly when combining both modalities and scales with the amount of data. When trained on the full IEMOCAP dataset, we reach a new state-of-the-art of 73.9% unweighted accuracy (UA)

    Segment Relevance Estimation for Audio Analysis and Weakly-Labelled Classification

    Full text link
    We propose a method that quantifies the importance, namely relevance, of audio segments for classification in weakly-labelled problems. It works by drawing information from a set of class-wise one-vs-all classifiers. By selecting the classifiers used in each specific classification problem, the relevance measure adapts to different user-defined viewpoints without requiring additional neural network training. This characteristic allows the relevance measure to highlight audio segments that quickly adapt to user-defined criteria. Such functionality can be used for computer-assisted audio analysis. Also, we propose a neural network architecture, namely RELNET, that leverages the relevance measure for weakly-labelled audio classification problems. RELNET was evaluated in the DCASE2018 dataset and achieved competitive classification results when compared to previous attention-based proposals.Comment: Submitted to IEEE Signal Processing Letter

    Learning Fine-Grained Cross Modality Excitement for Speech Emotion Recognition

    Full text link
    Speech emotion recognition is a challenging task because the emotion expression is complex, multimodal and fine-grained. In this paper, we propose a novel multimodal deep learning approach to perform fine-grained emotion recognition from real-life speeches. We design a temporal alignment mean-max pooling mechanism to capture the subtle and fine-grained emotions implied in every utterance. In addition, we propose a cross modality excitement module to conduct sample-specific adjustment on cross modality embeddings and adaptively recalibrate the corresponding values by its aligned latent features from the other modality. Our proposed model is evaluated on two well-known real-world speech emotion recognition datasets. The results demonstrate that our approach is superior on the prediction tasks for multimodal speech utterances, and it outperforms a wide range of baselines in terms of prediction accuracy. Further more, we conduct detailed ablation studies to show that our temporal alignment mean-max pooling mechanism and cross modality excitement significantly contribute to the promising results. In order to encourage the research reproducibility, we make the code publicly available at \url{https://github.com/tal-ai/FG_CME.git}.Comment: The Interspeech Conference, 2021 (INTERSPEECH 2021

    Multimodal Continuous Emotion Recognition using Deep Multi-Task Learning with Correlation Loss

    Full text link
    In this study, we focus on continuous emotion recognition using body motion and speech signals to estimate Activation, Valence, and Dominance (AVD) attributes. Semi-End-To-End network architecture is proposed where both extracted features and raw signals are fed, and this network is trained using multi-task learning (MTL) rather than the state-of-the-art single task learning (STL). Furthermore, correlation losses, Concordance Correlation Coefficient (CCC) and Pearson Correlation Coefficient (PCC), are used as an optimization objective during the training. Experiments are conducted on CreativeIT and RECOLA database, and evaluations are performed using the CCC metric. To highlight the effect of MTL, correlation losses and multi-modality, we respectively compare the performance of MTL against STL, CCC loss against root mean square error (MSE) loss and, PCC loss, multi-modality against single modality. We observe significant performance improvements with MTL training over STL, especially for estimation of the valence. Furthermore, the CCC loss achieves more than 7% CCC improvements on CreativeIT, and 13% improvements on RECOLA against MSE loss.Comment: 6 pages,lette

    Multiscale Fractal Analysis on EEG Signals for Music-Induced Emotion Recognition

    Full text link
    Emotion Recognition from EEG signals has long been researched as it can assist numerous medical and rehabilitative applications. However, their complex and noisy structure has proven to be a serious barrier for traditional modeling methods. In this paper, we employ multifractal analysis to examine the behavior of EEG signals in terms of presence of fluctuations and the degree of fragmentation along their major frequency bands, for the task of emotion recognition. In order to extract emotion-related features we utilize two novel algorithms for EEG analysis, based on Multiscale Fractal Dimension and Multifractal Detrended Fluctuation Analysis. The proposed feature extraction methods perform efficiently, surpassing some widely used baseline features on the competitive DEAP dataset, indicating that multifractal analysis could serve as basis for the development of robust models for affective state recognition.Comment: 5 pages, 3 figures, 3 tables, European Signal Processing Conference (EUSIPCO) 2021, Dublin, Irelan

    Deep neural networks for emotion recognition combining audio and transcripts

    Full text link
    In this paper, we propose to improve emotion recognition by combining acoustic information and conversation transcripts. On the one hand, an LSTM network was used to detect emotion from acoustic features like f0, shimmer, jitter, MFCC, etc. On the other hand, a multi-resolution CNN was used to detect emotion from word sequences. This CNN consists of several parallel convolutions with different kernel sizes to exploit contextual information at different levels. A temporal pooling layer aggregates the hidden representations of different words into a unique sequence level embedding, from which we computed the emotion posteriors. We optimized a weighted sum of classification and verification losses. The verification loss tries to bring embeddings from the same emotions closer while separating embeddings from different emotions. We also compared our CNN with state-of-the-art text-based hand-crafted features (e-vector). We evaluated our approach on the USC-IEMOCAP dataset as well as the dataset consisting of US English telephone speech. In the former, we used human-annotated transcripts while in the latter, we used ASR transcripts. The results showed fusing audio and transcript information improved unweighted accuracy by relative 24% for IEMOCAP and relative 3.4% for the telephone data compared to a single acoustic system

    Learning Alignment for Multimodal Emotion Recognition from Speech

    Full text link
    Speech emotion recognition is a challenging problem because human convey emotions in subtle and complex ways. For emotion recognition on human speech, one can either extract emotion related features from audio signals or employ speech recognition techniques to generate text from speech and then apply natural language processing to analyze the sentiment. Further, emotion recognition will be beneficial from using audio-textual multimodal information, it is not trivial to build a system to learn from multimodality. One can build models for two input sources separately and combine them in a decision level, but this method ignores the interaction between speech and text in the temporal domain. In this paper, we propose to use an attention mechanism to learn the alignment between speech frames and text words, aiming to produce more accurate multimodal feature representations. The aligned multimodal features are fed into a sequential model for emotion recognition. We evaluate the approach on the IEMOCAP dataset and the experimental results show the proposed approach achieves the state-of-the-art performance on the dataset.Comment: InterSpeech 201
    corecore