799 research outputs found
Survey of deep representation learning for speech emotion recognition
Traditionally, speech emotion recognition (SER) research has relied on manually handcrafted acoustic features using feature engineering. However, the design of handcrafted features for complex SER tasks requires significant manual eort, which impedes generalisability and slows the pace of innovation. This has motivated the adoption of representation learning techniques that can automatically learn an intermediate representation of the input signal without any manual feature engineering. Representation learning has led to improved SER performance and enabled rapid innovation. Its effectiveness has further increased with advances in deep learning (DL), which has facilitated \textit{deep representation learning} where hierarchical representations are automatically learned in a data-driven manner. This paper presents the first comprehensive survey on the important topic of deep representation learning for SER. We highlight various techniques, related challenges and identify important future areas of research. Our survey bridges the gap in the literature since existing surveys either focus on SER with hand-engineered features or representation learning in the general setting without focusing on SER
Feature Learning from Spectrograms for Assessment of Personality Traits
Several methods have recently been proposed to analyze speech and
automatically infer the personality of the speaker. These methods often rely on
prosodic and other hand crafted speech processing features extracted with
off-the-shelf toolboxes. To achieve high accuracy, numerous features are
typically extracted using complex and highly parameterized algorithms. In this
paper, a new method based on feature learning and spectrogram analysis is
proposed to simplify the feature extraction process while maintaining a high
level of accuracy. The proposed method learns a dictionary of discriminant
features from patches extracted in the spectrogram representations of training
speech segments. Each speech segment is then encoded using the dictionary, and
the resulting feature set is used to perform classification of personality
traits. Experiments indicate that the proposed method achieves state-of-the-art
results with a significant reduction in complexity when compared to the most
recent reference methods. The number of features, and difficulties linked to
the feature extraction process are greatly reduced as only one type of
descriptors is used, for which the 6 parameters can be tuned automatically. In
contrast, the simplest reference method uses 4 types of descriptors to which 6
functionals are applied, resulting in over 20 parameters to be tuned.Comment: 12 pages, 3 figure
Unsupervised Adversarial Domain Adaptation for Cross-Lingual Speech Emotion Recognition
Cross-lingual speech emotion recognition (SER) is a crucial task for many
real-world applications. The performance of SER systems is often degraded by
the differences in the distributions of training and test data. These
differences become more apparent when training and test data belong to
different languages, which cause a significant performance gap between the
validation and test scores. It is imperative to build more robust models that
can fit in practical applications of SER systems. Therefore, in this paper, we
propose a Generative Adversarial Network (GAN)-based model for multilingual
SER. Our choice of using GAN is motivated by their great success in learning
the underlying data distribution. The proposed model is designed in such a way
that can learn language invariant representations without requiring
target-language data labels. We evaluate our proposed model on four different
language emotional datasets, including an Urdu-language dataset to also
incorporate alternative languages for which labelled data is difficult to find
and which have not been studied much by the mainstream community. Our results
show that our proposed model can significantly improve the baseline
cross-lingual SER performance for all the considered datasets including the
non-mainstream Urdu language data without requiring any labels.Comment: Accepted in Affective Computing & Intelligent Interaction (ACII 2019
Comprehensive Study of Automatic Speech Emotion Recognition Systems
Speech emotion recognition (SER) is the technology that recognizes psychological characteristics and feelings from the speech signals through techniques and methodologies. SER is challenging because of more considerable variations in different languages arousal and valence levels. Various technical developments in artificial intelligence and signal processing methods have encouraged and made it possible to interpret emotions.SER plays a vital role in remote communication. This paper offers a recent survey of SER using machine learning (ML) and deep learning (DL)-based techniques. It focuses on the various feature representation and classification techniques used for SER. Further, it describes details about databases and evaluation metrics used for speech emotion recognition
Robust subspace learning for static and dynamic affect and behaviour modelling
Machine analysis of human affect and behavior in naturalistic contexts has witnessed a growing attention in the last decade from various disciplines ranging from social and cognitive sciences to machine learning and computer vision. Endowing machines with the ability to seamlessly detect, analyze, model, predict as well as simulate and synthesize manifestations of internal emotional and behavioral states in real-world data is deemed essential for the deployment of next-generation, emotionally- and socially-competent human-centered interfaces. In this thesis, we are primarily motivated by the problem of modeling, recognizing and predicting spontaneous expressions of non-verbal human affect and behavior manifested through either low-level facial attributes in static images or high-level semantic events in image sequences. Both visual data and annotations of naturalistic affect and behavior naturally contain noisy measurements of unbounded magnitude at random locations, commonly referred to as ‘outliers’. We present here machine learning methods that are robust to such gross, sparse noise. First, we deal with static analysis of face images, viewing the latter as a superposition of mutually-incoherent, low-complexity components corresponding to facial attributes, such as facial identity, expressions and activation of atomic facial muscle actions. We develop a robust, discriminant dictionary learning framework to extract these components from grossly corrupted training data and combine it with sparse representation to recognize the associated attributes. We demonstrate that our framework can jointly address interrelated classification tasks such as face and facial expression recognition. Inspired by the well-documented importance of the temporal aspect in perceiving affect and behavior, we direct the bulk of our research efforts into continuous-time modeling of dimensional affect and social behavior. Having identified a gap in the literature which is the lack of data containing annotations of social attitudes in continuous time and scale, we first curate a new audio-visual database of multi-party conversations from political debates annotated frame-by-frame in terms of real-valued conflict intensity and use it to conduct the first study on continuous-time conflict intensity estimation. Our experimental findings corroborate previous evidence indicating the inability of existing classifiers in capturing the hidden temporal structures of affective and behavioral displays. We present here a novel dynamic behavior analysis framework which models temporal dynamics in an explicit way, based on the natural assumption that continuous- time annotations of smoothly-varying affect or behavior can be viewed as outputs of a low-complexity linear dynamical system when behavioral cues (features) act as system inputs. A novel robust structured rank minimization framework is proposed to estimate the system parameters in the presence of gross corruptions and partially missing data. Experiments on prediction of dimensional conflict and affect as well as multi-object tracking from detection validate the effectiveness of our predictive framework and demonstrate that for the first time that complex human behavior and affect can be learned and predicted based on small training sets of person(s)-specific observations.Open Acces
Cross-domain Adaptation with Discrepancy Minimization for Text-independent Forensic Speaker Verification
Forensic audio analysis for speaker verification offers unique challenges due
to location/scenario uncertainty and diversity mismatch between reference and
naturalistic field recordings. The lack of real naturalistic forensic audio
corpora with ground-truth speaker identity represents a major challenge in this
field. It is also difficult to directly employ small-scale domain-specific data
to train complex neural network architectures due to domain mismatch and loss
in performance. Alternatively, cross-domain speaker verification for multiple
acoustic environments is a challenging task which could advance research in
audio forensics. In this study, we introduce a CRSS-Forensics audio dataset
collected in multiple acoustic environments. We pre-train a CNN-based network
using the VoxCeleb data, followed by an approach which fine-tunes part of the
high-level network layers with clean speech from CRSS-Forensics. Based on this
fine-tuned model, we align domain-specific distributions in the embedding space
with the discrepancy loss and maximum mean discrepancy (MMD). This maintains
effective performance on the clean set, while simultaneously generalizes the
model to other acoustic domains. From the results, we demonstrate that diverse
acoustic environments affect the speaker verification performance, and that our
proposed approach of cross-domain adaptation can significantly improve the
results in this scenario.Comment: To appear in INTERSPEECH 202
- …