98 research outputs found
Learning spectral-temporal features with 3D CNNs for speech emotion recognition
In this paper, we propose to use deep 3-dimensional convolutional networks (3D CNNs) in order to address the challenge of modelling spectro-temporal dynamics for speech emotion recognition (SER). Compared to a hybrid of Convolutional Neural Network and Long-Short-Term-Memory (CNN-LSTM), our proposed 3D CNNs simultaneously extract short-term and long-term spectral features with a moderate number of parameters. We evaluated our proposed and other state-of-the-art methods in a speaker-independent manner using aggregated corpora that give a large and diverse set of speakers. We found that 1) shallow temporal and moderately deep spectral kernels of a homogeneous architecture are optimal for the task; and 2) our 3D CNNs are more effective for spectro-temporal feature learning compared to other methods. Finally, we visualised the feature space obtained with our proposed method using t-distributed stochastic neighbour embedding (T-SNE) and could observe distinct clusters of emotions
Learning Speech Emotion Representations in the Quaternion Domain
The modeling of human emotion expression in speech signals is an important,
yet challenging task. The high resource demand of speech emotion recognition
models, combined with the the general scarcity of emotion-labelled data are
obstacles to the development and application of effective solutions in this
field. In this paper, we present an approach to jointly circumvent these
difficulties. Our method, named RH-emo, is a novel semi-supervised architecture
aimed at extracting quaternion embeddings from real-valued monoaural
spectrograms, enabling the use of quaternion-valued networks for speech emotion
recognition tasks. RH-emo is a hybrid real/quaternion autoencoder network that
consists of a real-valued encoder in parallel to a real-valued emotion
classifier and a quaternion-valued decoder. On the one hand, the classifier
permits to optimize each latent axis of the embeddings for the classification
of a specific emotion-related characteristic: valence, arousal, dominance and
overall emotion. On the other hand, the quaternion reconstruction enables the
latent dimension to develop intra-channel correlations that are required for an
effective representation as a quaternion entity. We test our approach on speech
emotion recognition tasks using four popular datasets: Iemocap, Ravdess, EmoDb
and Tess, comparing the performance of three well-established real-valued CNN
architectures (AlexNet, ResNet-50, VGG) and their quaternion-valued equivalent
fed with the embeddings created with RH-emo. We obtain a consistent improvement
in the test accuracy for all datasets, while drastically reducing the
resources' demand of models. Moreover, we performed additional experiments and
ablation studies that confirm the effectiveness of our approach. The RH-emo
repository is available at: https://github.com/ispamm/rhemo.Comment: Paper Submitted to IEEE/ACM Transactions on Audio, Speech and
Language Processin
Multitask Learning from Augmented Auxiliary Data for Improving Speech Emotion Recognition
Despite the recent progress in speech emotion recognition (SER),
state-of-the-art systems lack generalisation across different conditions. A key
underlying reason for poor generalisation is the scarcity of emotion datasets,
which is a significant roadblock to designing robust machine learning (ML)
models. Recent works in SER focus on utilising multitask learning (MTL) methods
to improve generalisation by learning shared representations. However, most of
these studies propose MTL solutions with the requirement of meta labels for
auxiliary tasks, which limits the training of SER systems. This paper proposes
an MTL framework (MTL-AUG) that learns generalised representations from
augmented data. We utilise augmentation-type classification and unsupervised
reconstruction as auxiliary tasks, which allow training SER systems on
augmented data without requiring any meta labels for auxiliary tasks. The
semi-supervised nature of MTL-AUG allows for the exploitation of the abundant
unlabelled data to further boost the performance of SER. We comprehensively
evaluate the proposed framework in the following settings: (1) within corpus,
(2) cross-corpus and cross-language, (3) noisy speech, (4) and adversarial
attacks. Our evaluations using the widely used IEMOCAP, MSP-IMPROV, and EMODB
datasets show improved results compared to existing state-of-the-art methods.Comment: Under review IEEE Transactions on Affective Computin
Attention-Inspired Artificial Neural Networks for Speech Processing: A Systematic Review
Artificial Neural Networks (ANNs) were created inspired by the neural networks in the human brain and have been widely applied in speech processing. The application areas of ANN include: Speech recognition, speech emotion recognition, language identification, speech enhancement, and speech separation, amongst others. Likewise, given that speech processing performed by humans involves complex cognitive processes known as auditory attention, there has been a growing amount of papers proposing ANNs supported by deep learning algorithms in conjunction with some mechanism to achieve symmetry with the human attention process. However, while these ANN approaches include attention, there is no categorization of attention integrated into the deep learning algorithms and their relation with human auditory attention. Therefore, we consider it necessary to have a review of the different ANN approaches inspired in attention to show both academic and industry experts the available models for a wide variety of applications. Based on the PRISMA methodology, we present a systematic review of the literature published since 2000, in which deep learning algorithms are applied to diverse problems related to speech processing. In this paper 133 research works are selected and the following aspects are described: (i) Most relevant features, (ii) ways in which attention has been implemented, (iii) their hypothetical relationship with human attention, and (iv) the evaluation metrics used. Additionally, the four publications most related with human attention were analyzed and their strengths and weaknesses were determined
SPEAKER VGG CCT: Cross-corpus Speech Emotion Recognition with Speaker Embedding and Vision Transformers
In recent years, Speech Emotion Recognition (SER) has been investigated
mainly transforming the speech signal into spectrograms that are then
classified using Convolutional Neural Networks pretrained on generic images and
fine tuned with spectrograms. In this paper, we start from the general idea
above and develop a new learning solution for SER, which is based on Compact
Convolutional Transformers (CCTs) combined with a speaker embedding. With CCTs,
the learning power of Vision Transformers (ViT) is combined with a diminished
need for large volume of data as made possible by the convolution. This is
important in SER, where large corpora of data are usually not available. The
speaker embedding allows the network to extract an identity representation of
the speaker, which is then integrated by means of a self-attention mechanism
with the features that the CCT extracts from the spectrogram. Overall, the
solution is capable of operating in real-time showing promising results in a
cross-corpus scenario, where training and test datasets are kept separate.
Experiments have been performed on several benchmarks in a cross-corpus setting
as rarely used in the literature, with results that are comparable or superior
to those obtained with state-of-the-art network architectures. Our code is
available at https://github.com/JabuMlDev/Speaker-VGG-CCT
Survey of deep representation learning for speech emotion recognition
Traditionally, speech emotion recognition (SER) research has relied on manually handcrafted acoustic features using feature engineering. However, the design of handcrafted features for complex SER tasks requires significant manual eort, which impedes generalisability and slows the pace of innovation. This has motivated the adoption of representation learning techniques that can automatically learn an intermediate representation of the input signal without any manual feature engineering. Representation learning has led to improved SER performance and enabled rapid innovation. Its effectiveness has further increased with advances in deep learning (DL), which has facilitated \textit{deep representation learning} where hierarchical representations are automatically learned in a data-driven manner. This paper presents the first comprehensive survey on the important topic of deep representation learning for SER. We highlight various techniques, related challenges and identify important future areas of research. Our survey bridges the gap in the literature since existing surveys either focus on SER with hand-engineered features or representation learning in the general setting without focusing on SER
- …