40 research outputs found
Improving Noise Robustness In Speaker Identification Using A Two-Stage Attention Model
While the use of deep neural networks has significantly boosted speaker
recognition performance, it is still challenging to separate speakers in poor
acoustic environments. To improve robustness of speaker recognition system
performance in noise, a novel two-stage attention mechanism which can be used
in existing architectures such as Time Delay Neural Networks (TDNNs) and
Convolutional Neural Networks (CNNs) is proposed. Noise is known to often mask
important information in both time and frequency domain. The proposed mechanism
allows the models to concentrate on reliable time/frequency components of the
signal. The proposed approach is evaluated using the Voxceleb1 dataset, which
aims at assessment of speaker recognition in real world situations. In addition
three types of noise at different signal-noise-ratios (SNRs) were added for
this work. The proposed mechanism is compared with three strong baselines:
X-vectors, Attentive X-vector, and Resnet-34. Results on both identification
and verification tasks show that the two-stage attention mechanism consistently
improves upon these for all noise conditions.Comment: Submitted to Interspeech202
Addressing Ambiguity of Emotion Labels Through Meta-Learning
Emotion labels in emotion recognition corpora are highly noisy and ambiguous,
due to the annotators' subjective perception of emotions. Such ambiguity may
introduce errors in automatic classification and affect the overall
performance. We therefore propose a dynamic label correction and sample
contribution weight estimation model. Our model is based on a standard BLSTM
model with attention with two extra parameters. The first learns a new
corrected label distribution, and is aimed to fix the inaccurate labels from
the dataset. The other instead estimates the contribution of each sample to the
training process, and is aimed to ignore the ambiguous and noisy samples while
giving higher weight to the clear ones. We train our model through an
alternating optimization method, where in the first epoch we update the neural
network parameters, and in the second we keep them fixed to update the label
correction and sample importance parameters. When training and evaluating our
model on the IEMOCAP dataset, we obtained a weighted accuracy (WA) and
unweighted accuracy (UA) of respectively 65.9% and 61.4%. This yielded an
absolute improvement of 2.5%, 2.7% respectively compared to a BLSTM with
attention baseline, trained on the corpus gold labels.Comment: Submitted to ICASSP 202
EigenEmo: Spectral Utterance Representation Using Dynamic Mode Decomposition for Speech Emotion Classification
Human emotional speech is, by its very nature, a variant signal. This results
in dynamics intrinsic to automatic emotion classification based on speech. In
this work, we explore a spectral decomposition method stemming from
fluid-dynamics, known as Dynamic Mode Decomposition (DMD), to computationally
represent and analyze the global utterance-level dynamics of emotional speech.
Specifically, segment-level emotion-specific representations are first learned
through an Emotion Distillation process. This forms a multi-dimensional signal
of emotion flow for each utterance, called Emotion Profiles (EPs). The DMD
algorithm is then applied to the resultant EPs to capture the eigenfrequencies,
and hence the fundamental transition dynamics of the emotion flow. Evaluation
experiments using the proposed approach, which we call EigenEmo, show promising
results. Moreover, due to the positive combination of their complementary
properties, concatenating the utterance representations generated by EigenEmo
with simple EPs averaging yields noticeable gains
Recognizing More Emotions with Less Data Using Self-supervised Transfer Learning
We propose a novel transfer learning method for speech emotion recognition
allowing us to obtain promising results when only few training data is
available. With as low as 125 examples per emotion class, we were able to reach
a higher accuracy than a strong baseline trained on 8 times more data. Our
method leverages knowledge contained in pre-trained speech representations
extracted from models trained on a more general self-supervised task which
doesn't require human annotations, such as the wav2vec model. We provide
detailed insights on the benefits of our approach by varying the training data
size, which can help labeling teams to work more efficiently. We compare
performance with other popular methods on the IEMOCAP dataset, a
well-benchmarked dataset among the Speech Emotion Recognition (SER) research
community. Furthermore, we demonstrate that results can be greatly improved by
combining acoustic and linguistic knowledge from transfer learning. We align
acoustic pre-trained representations with semantic representations from the
BERT model through an attention-based recurrent neural network. Performance
improves significantly when combining both modalities and scales with the
amount of data. When trained on the full IEMOCAP dataset, we reach a new
state-of-the-art of 73.9% unweighted accuracy (UA)
Segment Relevance Estimation for Audio Analysis and Weakly-Labelled Classification
We propose a method that quantifies the importance, namely relevance, of
audio segments for classification in weakly-labelled problems. It works by
drawing information from a set of class-wise one-vs-all classifiers. By
selecting the classifiers used in each specific classification problem, the
relevance measure adapts to different user-defined viewpoints without requiring
additional neural network training. This characteristic allows the relevance
measure to highlight audio segments that quickly adapt to user-defined
criteria. Such functionality can be used for computer-assisted audio analysis.
Also, we propose a neural network architecture, namely RELNET, that leverages
the relevance measure for weakly-labelled audio classification problems. RELNET
was evaluated in the DCASE2018 dataset and achieved competitive classification
results when compared to previous attention-based proposals.Comment: Submitted to IEEE Signal Processing Letter
Learning Fine-Grained Cross Modality Excitement for Speech Emotion Recognition
Speech emotion recognition is a challenging task because the emotion
expression is complex, multimodal and fine-grained. In this paper, we propose a
novel multimodal deep learning approach to perform fine-grained emotion
recognition from real-life speeches. We design a temporal alignment mean-max
pooling mechanism to capture the subtle and fine-grained emotions implied in
every utterance. In addition, we propose a cross modality excitement module to
conduct sample-specific adjustment on cross modality embeddings and adaptively
recalibrate the corresponding values by its aligned latent features from the
other modality. Our proposed model is evaluated on two well-known real-world
speech emotion recognition datasets. The results demonstrate that our approach
is superior on the prediction tasks for multimodal speech utterances, and it
outperforms a wide range of baselines in terms of prediction accuracy. Further
more, we conduct detailed ablation studies to show that our temporal alignment
mean-max pooling mechanism and cross modality excitement significantly
contribute to the promising results. In order to encourage the research
reproducibility, we make the code publicly available at
\url{https://github.com/tal-ai/FG_CME.git}.Comment: The Interspeech Conference, 2021 (INTERSPEECH 2021
Multimodal Continuous Emotion Recognition using Deep Multi-Task Learning with Correlation Loss
In this study, we focus on continuous emotion recognition using body motion
and speech signals to estimate Activation, Valence, and Dominance (AVD)
attributes. Semi-End-To-End network architecture is proposed where both
extracted features and raw signals are fed, and this network is trained using
multi-task learning (MTL) rather than the state-of-the-art single task learning
(STL). Furthermore, correlation losses, Concordance Correlation Coefficient
(CCC) and Pearson Correlation Coefficient (PCC), are used as an optimization
objective during the training. Experiments are conducted on CreativeIT and
RECOLA database, and evaluations are performed using the CCC metric. To
highlight the effect of MTL, correlation losses and multi-modality, we
respectively compare the performance of MTL against STL, CCC loss against root
mean square error (MSE) loss and, PCC loss, multi-modality against single
modality. We observe significant performance improvements with MTL training
over STL, especially for estimation of the valence. Furthermore, the CCC loss
achieves more than 7% CCC improvements on CreativeIT, and 13% improvements on
RECOLA against MSE loss.Comment: 6 pages,lette
Multiscale Fractal Analysis on EEG Signals for Music-Induced Emotion Recognition
Emotion Recognition from EEG signals has long been researched as it can
assist numerous medical and rehabilitative applications. However, their complex
and noisy structure has proven to be a serious barrier for traditional modeling
methods. In this paper, we employ multifractal analysis to examine the behavior
of EEG signals in terms of presence of fluctuations and the degree of
fragmentation along their major frequency bands, for the task of emotion
recognition. In order to extract emotion-related features we utilize two novel
algorithms for EEG analysis, based on Multiscale Fractal Dimension and
Multifractal Detrended Fluctuation Analysis. The proposed feature extraction
methods perform efficiently, surpassing some widely used baseline features on
the competitive DEAP dataset, indicating that multifractal analysis could serve
as basis for the development of robust models for affective state recognition.Comment: 5 pages, 3 figures, 3 tables, European Signal Processing Conference
(EUSIPCO) 2021, Dublin, Irelan
Deep neural networks for emotion recognition combining audio and transcripts
In this paper, we propose to improve emotion recognition by combining
acoustic information and conversation transcripts. On the one hand, an LSTM
network was used to detect emotion from acoustic features like f0, shimmer,
jitter, MFCC, etc. On the other hand, a multi-resolution CNN was used to detect
emotion from word sequences. This CNN consists of several parallel convolutions
with different kernel sizes to exploit contextual information at different
levels. A temporal pooling layer aggregates the hidden representations of
different words into a unique sequence level embedding, from which we computed
the emotion posteriors. We optimized a weighted sum of classification and
verification losses. The verification loss tries to bring embeddings from the
same emotions closer while separating embeddings from different emotions. We
also compared our CNN with state-of-the-art text-based hand-crafted features
(e-vector). We evaluated our approach on the USC-IEMOCAP dataset as well as the
dataset consisting of US English telephone speech. In the former, we used
human-annotated transcripts while in the latter, we used ASR transcripts. The
results showed fusing audio and transcript information improved unweighted
accuracy by relative 24% for IEMOCAP and relative 3.4% for the telephone data
compared to a single acoustic system
Learning Alignment for Multimodal Emotion Recognition from Speech
Speech emotion recognition is a challenging problem because human convey
emotions in subtle and complex ways. For emotion recognition on human speech,
one can either extract emotion related features from audio signals or employ
speech recognition techniques to generate text from speech and then apply
natural language processing to analyze the sentiment. Further, emotion
recognition will be beneficial from using audio-textual multimodal information,
it is not trivial to build a system to learn from multimodality. One can build
models for two input sources separately and combine them in a decision level,
but this method ignores the interaction between speech and text in the temporal
domain. In this paper, we propose to use an attention mechanism to learn the
alignment between speech frames and text words, aiming to produce more accurate
multimodal feature representations. The aligned multimodal features are fed
into a sequential model for emotion recognition. We evaluate the approach on
the IEMOCAP dataset and the experimental results show the proposed approach
achieves the state-of-the-art performance on the dataset.Comment: InterSpeech 201