2,696 research outputs found
Convolutional RNN: an Enhanced Model for Extracting Features from Sequential Data
Traditional convolutional layers extract features from patches of data by
applying a non-linearity on an affine function of the input. We propose a model
that enhances this feature extraction process for the case of sequential data,
by feeding patches of the data into a recurrent neural network and using the
outputs or hidden states of the recurrent units to compute the extracted
features. By doing so, we exploit the fact that a window containing a few
frames of the sequential data is a sequence itself and this additional
structure might encapsulate valuable information. In addition, we allow for
more steps of computation in the feature extraction process, which is
potentially beneficial as an affine function followed by a non-linearity can
result in too simple features. Using our convolutional recurrent layers we
obtain an improvement in performance in two audio classification tasks,
compared to traditional convolutional layers. Tensorflow code for the
convolutional recurrent layers is publicly available in
https://github.com/cruvadom/Convolutional-RNN
Learning Representations of Emotional Speech with Deep Convolutional Generative Adversarial Networks
Automatically assessing emotional valence in human speech has historically
been a difficult task for machine learning algorithms. The subtle changes in
the voice of the speaker that are indicative of positive or negative emotional
states are often "overshadowed" by voice characteristics relating to emotional
intensity or emotional activation. In this work we explore a representation
learning approach that automatically derives discriminative representations of
emotional speech. In particular, we investigate two machine learning strategies
to improve classifier performance: (1) utilization of unlabeled data using a
deep convolutional generative adversarial network (DCGAN), and (2) multitask
learning. Within our extensive experiments we leverage a multitask annotated
emotional corpus as well as a large unlabeled meeting corpus (around 100
hours). Our speaker-independent classification experiments show that in
particular the use of unlabeled data in our investigations improves performance
of the classifiers and both fully supervised baseline approaches are
outperformed considerably. We improve the classification of emotional valence
on a discrete 5-point scale to 43.88% and on a 3-point scale to 49.80%, which
is competitive to state-of-the-art performance
Evaluating raw waveforms with deep learning frameworks for speech emotion recognition
Speech emotion recognition is a challenging task in speech processing field.
For this reason, feature extraction process has a crucial importance to
demonstrate and process the speech signals. In this work, we represent a model,
which feeds raw audio files directly into the deep neural networks without any
feature extraction stage for the recognition of emotions utilizing six
different data sets, EMO-DB, RAVDESS, TESS, CREMA, SAVEE, and TESS+RAVDESS. To
demonstrate the contribution of proposed model, the performance of traditional
feature extraction techniques namely, mel-scale spectogram, mel-frequency
cepstral coefficients, are blended with machine learning algorithms, ensemble
learning methods, deep and hybrid deep learning techniques. Support vector
machine, decision tree, naive Bayes, random forests models are evaluated as
machine learning algorithms while majority voting and stacking methods are
assessed as ensemble learning techniques. Moreover, convolutional neural
networks, long short-term memory networks, and hybrid CNN- LSTM model are
evaluated as deep learning techniques and compared with machine learning and
ensemble learning methods. To demonstrate the effectiveness of proposed model,
the comparison with state-of-the-art studies are carried out. Based on the
experiment results, CNN model excels existent approaches with 95.86% of
accuracy for TESS+RAVDESS data set using raw audio files, thence determining
the new state-of-the-art. The proposed model performs 90.34% of accuracy for
EMO-DB with CNN model, 90.42% of accuracy for RAVDESS with CNN model, 99.48% of
accuracy for TESS with LSTM model, 69.72% of accuracy for CREMA with CNN model,
85.76% of accuracy for SAVEE with CNN model in speaker-independent audio
categorization problems.Comment: 14 pages, 6 Figures, 8 Table
- …