14 research outputs found
Enhancing Speech Emotion Recognition Through Differentiable Architecture Search
Speech Emotion Recognition (SER) is a critical enabler of emotion-aware
communication in human-computer interactions. Recent advancements in Deep
Learning (DL) have substantially enhanced the performance of SER models through
increased model complexity. However, designing optimal DL architectures
requires prior experience and experimental evaluations. Encouragingly, Neural
Architecture Search (NAS) offers a promising avenue to determine an optimal DL
model automatically. In particular, Differentiable Architecture Search (DARTS)
is an efficient method of using NAS to search for optimised models. This paper
proposes a DARTS-optimised joint CNN and LSTM architecture, to improve SER
performance, where the literature informs the selection of CNN and LSTM
coupling to offer improved performance. While DARTS has previously been applied
to CNN and LSTM combinations, our approach introduces a novel mechanism,
particularly in selecting CNN operations using DARTS. In contrast to previous
studies, we refrain from imposing constraints on the order of the layers for
the CNN within the DARTS cell; instead, we allow DARTS to determine the optimal
layer order autonomously. Experimenting with the IEMOCAP and MSP-IMPROV
datasets, we demonstrate that our proposed methodology achieves significantly
higher SER accuracy than hand-engineering the CNN-LSTM configuration. It also
outperforms the best-reported SER results achieved using DARTS on CNN-LSTM.Comment: 5 pages, 4 figure
AAEC: An Adversarial Autoencoder-based Classifier for Audio Emotion Recognition
Changzeng Fu, Jiaqi Shi, Chaoran Liu, Carlos Toshinori Ishi, and Hiroshi Ishiguro. 2020. AAEC: An Adversarial Autoencoder-based Classifier for Audio Emotion Recognition. In Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop (MuSe'20). Association for Computing Machinery, New York, NY, USA, 45–51. DOI:https://doi.org/10.1145/3423327.3423669.MM '20: The 28th ACM International Conference on Multimedia [October 16, 2020
Unsupervised Adversarial Domain Adaptation for Cross-Lingual Speech Emotion Recognition
Cross-lingual speech emotion recognition (SER) is a crucial task for many
real-world applications. The performance of SER systems is often degraded by
the differences in the distributions of training and test data. These
differences become more apparent when training and test data belong to
different languages, which cause a significant performance gap between the
validation and test scores. It is imperative to build more robust models that
can fit in practical applications of SER systems. Therefore, in this paper, we
propose a Generative Adversarial Network (GAN)-based model for multilingual
SER. Our choice of using GAN is motivated by their great success in learning
the underlying data distribution. The proposed model is designed in such a way
that can learn language invariant representations without requiring
target-language data labels. We evaluate our proposed model on four different
language emotional datasets, including an Urdu-language dataset to also
incorporate alternative languages for which labelled data is difficult to find
and which have not been studied much by the mainstream community. Our results
show that our proposed model can significantly improve the baseline
cross-lingual SER performance for all the considered datasets including the
non-mainstream Urdu language data without requiring any labels.Comment: Accepted in Affective Computing & Intelligent Interaction (ACII 2019
Knowing What to Listen to: Early Attention for Deep Speech Representation Learning
Deep learning techniques have considerably improved speech processing in
recent years. Speech representations extracted by deep learning models are
being used in a wide range of tasks such as speech recognition, speaker
recognition, and speech emotion recognition. Attention models play an important
role in improving deep learning models. However current attention mechanisms
are unable to attend to fine-grained information items. In this paper we
propose the novel Fine-grained Early Frequency Attention (FEFA) for speech
signals. This model is capable of focusing on information items as small as
frequency bins. We evaluate the proposed model on two popular tasks of speaker
recognition and speech emotion recognition. Two widely used public datasets,
VoxCeleb and IEMOCAP, are used for our experiments. The model is implemented on
top of several prominent deep models as backbone networks to evaluate its
impact on performance compared to the original networks and other related work.
Our experiments show that by adding FEFA to different CNN architectures,
performance is consistently improved by substantial margins, even setting a new
state-of-the-art for the speaker recognition task. We also tested our model
against different levels of added noise showing improvements in robustness and
less sensitivity compared to the backbone networks
High-Fidelity Audio Generation and Representation Learning with Guided Adversarial Autoencoder
Unsupervised disentangled representation learning from the unlabelled audio
data, and high fidelity audio generation have become two linchpins in the
machine learning research fields. However, the representation learned from an
unsupervised setting does not guarantee its' usability for any downstream task
at hand, which can be a wastage of the resources, if the training was conducted
for that particular posterior job. Also, during the representation learning, if
the model is highly biased towards the downstream task, it losses its
generalisation capability which directly benefits the downstream job but the
ability to scale it to other related task is lost. Therefore, to fill this gap,
we propose a new autoencoder based model named "Guided Adversarial Autoencoder
(GAAE)", which can learn both post-task-specific representations and the
general representation capturing the factors of variation in the training data
leveraging a small percentage of labelled samples; thus, makes it suitable for
future related tasks. Furthermore, our proposed model can generate audio with
superior quality, which is indistinguishable from the real audio samples.
Hence, with the extensive experimental results, we have demonstrated that by
harnessing the power of the high-fidelity audio generation, the proposed GAAE
model can learn powerful representation from unlabelled dataset leveraging a
fewer percentage of labelled data as supervision/guidance
Survey of deep representation learning for speech emotion recognition
Traditionally, speech emotion recognition (SER) research has relied on manually handcrafted acoustic features using feature engineering. However, the design of handcrafted features for complex SER tasks requires significant manual eort, which impedes generalisability and slows the pace of innovation. This has motivated the adoption of representation learning techniques that can automatically learn an intermediate representation of the input signal without any manual feature engineering. Representation learning has led to improved SER performance and enabled rapid innovation. Its effectiveness has further increased with advances in deep learning (DL), which has facilitated \textit{deep representation learning} where hierarchical representations are automatically learned in a data-driven manner. This paper presents the first comprehensive survey on the important topic of deep representation learning for SER. We highlight various techniques, related challenges and identify important future areas of research. Our survey bridges the gap in the literature since existing surveys either focus on SER with hand-engineered features or representation learning in the general setting without focusing on SER
USING DEEP LEARNING-BASED FRAMEWORK FOR CHILD SPEECH EMOTION RECOGNITION
Biological languages of the body through which human emotion can be detected abound including heart rate, facial expressions, movement of the eyelids and dilation of the eyes, body postures, skin conductance, and even the speech we make. Speech emotion recognition research started some three decades ago, and the popular Interspeech Emotion Challenge has helped to propagate this research area. However, most speech recognition research is focused on adults and there is very little research on child speech. This dissertation is a description of the development and evaluation of a child speech emotion recognition framework. The higher-level components of the framework are designed to sort and separate speech based on the speaker’s age, ensuring that focus is only on speeches made by children. The framework uses Baddeley’s Theory of Working Memory to model a Working Memory Recurrent Network that can process and recognize emotions from speech. Baddeley’s Theory of Working Memory offers one of the best explanations on how the human brain holds and manipulates temporary information which is very crucial in the development of neural networks that learns effectively. Experiments were designed and performed to provide answers to the research questions, evaluate the proposed framework, and benchmark the performance of the framework with other methods. Satisfactory results were obtained from the experiments and in many cases, our framework was able to outperform other popular approaches. This study has implications for various applications of child speech emotion recognition such as child abuse detection and child learning robots