Search CORE

1,229 research outputs found

Speech Emotion Recognition with Dual-Sequence LSTM Architecture

Author: Culhane Ryan
Diao Enmao
Ding Jie
Tarokh Vahid
Wang Jianyou
Xue Michael
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 12/02/2020
Field of study

Speech Emotion Recognition (SER) has emerged as a critical component of the next generation human-machine interfacing technologies. In this work, we propose a new dual-level model that predicts emotions based on both MFCC features and mel-spectrograms produced from raw audio signals. Each utterance is preprocessed into MFCC features and two mel-spectrograms at different time-frequency resolutions. A standard LSTM processes the MFCC features, while a novel LSTM architecture, denoted as Dual-Sequence LSTM (DS-LSTM), processes the two mel-spectrograms simultaneously. The outputs are later averaged to produce a final classification of the utterance. Our proposed model achieves, on average, a weighted accuracy of 72.7% and an unweighted accuracy of 73.3%---a 6% improvement over current state-of-the-art unimodal models---and is comparable with multimodal models that leverage textual information as well as audio signals.Comment: Accepted by ICASSP 202

arXiv.org e-Print Archive

Crossref

Expediting TTS Synthesis with Adversarial Vocoding

Author: Donahue Chris
Dubnov Shlomo
McAuley Julian
Neekhara Paarth
Puckette Miller
Publication venue
Publication date: 16/04/2019
Field of study

Recent approaches in text-to-speech (TTS) synthesis employ neural network strategies to vocode perceptually-informed spectrogram representations directly into listenable waveforms. Such vocoding procedures create a computational bottleneck in modern TTS pipelines. We propose an alternative approach which utilizes generative adversarial networks (GANs) to learn mappings from perceptually-informed spectrograms to simple magnitude spectrograms which can be heuristically vocoded. Through a user study, we show that our approach significantly outperforms na\"ive vocoding strategies while being hundreds of times faster than neural network vocoders used in state-of-the-art TTS systems. We also show that our method can be used to achieve state-of-the-art results in unsupervised synthesis of individual words of speech.Comment: Published as a conference paper at INTERSPEECH 201

arXiv.org e-Print Archive

Crossref

eScholarship - University of California

Utilizing Domain Knowledge in End-to-End Audio Processing

Author: Antich Jose Luis Diez
Maaløe Lars
Purwins Hendrik
Tax Tycho Max Sylvester
Publication venue
Publication date: 01/12/2017
Field of study

End-to-end neural network based approaches to audio modelling are generally outperformed by models trained on high-level data representations. In this paper we present preliminary work that shows the feasibility of training the first layers of a deep convolutional neural network (CNN) model to learn the commonly-used log-scaled mel-spectrogram transformation. Secondly, we demonstrate that upon initializing the first layers of an end-to-end CNN classifier with the learned transformation, convergence and performance on the ESC-50 environmental sound classification dataset are similar to a CNN-based model trained on the highly pre-processed log-scaled mel-spectrogram features.Comment: Accepted at the ML4Audio workshop at the NIPS 201

arXiv.org e-Print Archive

VBN

MelHuBERT: A simplified HuBERT on Mel spectrograms

Author: Lee Hung-yi
Lin Tzu-Quan
Tang Hao
Publication venue
Publication date: 27/10/2023
Field of study

Self-supervised models have had great success in learning speech representations that can generalize to various downstream tasks. However, most self-supervised models require a large amount of compute and multiple GPUs to train, significantly hampering the development of self-supervised learning. In an attempt to reduce the computation of training, we revisit the training of HuBERT, a highly successful self-supervised model. We improve and simplify several key components, including the loss function, input representation, and training in multiple stages. Our model, MelHuBERT, is able to achieve favorable performance on phone recognition, speaker identification, and automatic speech recognition against HuBERT, while saving 31.2% of the pre-training time, or equivalently 33.5% MACs per one second speech. The code and pre-trained models are available in https://github.com/nervjack2/MelHuBERT.Comment: ASRU 202

arXiv.org e-Print Archive