1,229 research outputs found
Speech Emotion Recognition with Dual-Sequence LSTM Architecture
Speech Emotion Recognition (SER) has emerged as a critical component of the
next generation human-machine interfacing technologies. In this work, we
propose a new dual-level model that predicts emotions based on both MFCC
features and mel-spectrograms produced from raw audio signals. Each utterance
is preprocessed into MFCC features and two mel-spectrograms at different
time-frequency resolutions. A standard LSTM processes the MFCC features, while
a novel LSTM architecture, denoted as Dual-Sequence LSTM (DS-LSTM), processes
the two mel-spectrograms simultaneously. The outputs are later averaged to
produce a final classification of the utterance. Our proposed model achieves,
on average, a weighted accuracy of 72.7% and an unweighted accuracy of
73.3%---a 6% improvement over current state-of-the-art unimodal models---and is
comparable with multimodal models that leverage textual information as well as
audio signals.Comment: Accepted by ICASSP 202
Expediting TTS Synthesis with Adversarial Vocoding
Recent approaches in text-to-speech (TTS) synthesis employ neural network
strategies to vocode perceptually-informed spectrogram representations directly
into listenable waveforms. Such vocoding procedures create a computational
bottleneck in modern TTS pipelines. We propose an alternative approach which
utilizes generative adversarial networks (GANs) to learn mappings from
perceptually-informed spectrograms to simple magnitude spectrograms which can
be heuristically vocoded. Through a user study, we show that our approach
significantly outperforms na\"ive vocoding strategies while being hundreds of
times faster than neural network vocoders used in state-of-the-art TTS systems.
We also show that our method can be used to achieve state-of-the-art results in
unsupervised synthesis of individual words of speech.Comment: Published as a conference paper at INTERSPEECH 201
Utilizing Domain Knowledge in End-to-End Audio Processing
End-to-end neural network based approaches to audio modelling are generally
outperformed by models trained on high-level data representations. In this
paper we present preliminary work that shows the feasibility of training the
first layers of a deep convolutional neural network (CNN) model to learn the
commonly-used log-scaled mel-spectrogram transformation. Secondly, we
demonstrate that upon initializing the first layers of an end-to-end CNN
classifier with the learned transformation, convergence and performance on the
ESC-50 environmental sound classification dataset are similar to a CNN-based
model trained on the highly pre-processed log-scaled mel-spectrogram features.Comment: Accepted at the ML4Audio workshop at the NIPS 201
MelHuBERT: A simplified HuBERT on Mel spectrograms
Self-supervised models have had great success in learning speech
representations that can generalize to various downstream tasks. However, most
self-supervised models require a large amount of compute and multiple GPUs to
train, significantly hampering the development of self-supervised learning. In
an attempt to reduce the computation of training, we revisit the training of
HuBERT, a highly successful self-supervised model. We improve and simplify
several key components, including the loss function, input representation, and
training in multiple stages. Our model, MelHuBERT, is able to achieve favorable
performance on phone recognition, speaker identification, and automatic speech
recognition against HuBERT, while saving 31.2% of the pre-training time, or
equivalently 33.5% MACs per one second speech. The code and pre-trained models
are available in https://github.com/nervjack2/MelHuBERT.Comment: ASRU 202
- …