2 research outputs found
Bandwidth Extension on Raw Audio via Generative Adversarial Networks
Neural network-based methods have recently demonstrated state-of-the-art
results on image synthesis and super-resolution tasks, in particular by using
variants of generative adversarial networks (GANs) with supervised feature
losses. Nevertheless, previous feature loss formulations rely on the
availability of large auxiliary classifier networks, and labeled datasets that
enable such classifiers to be trained. Furthermore, there has been
comparatively little work to explore the applicability of GAN-based methods to
domains other than images and video. In this work we explore a GAN-based method
for audio processing, and develop a convolutional neural network architecture
to perform audio super-resolution. In addition to several new architectural
building blocks for audio processing, a key component of our approach is the
use of an autoencoder-based loss that enables training in the GAN framework,
with feature losses derived from unlabeled data. We explore the impact of our
architectural choices, and demonstrate significant improvements over previous
works in terms of both objective and perceptual quality
Waveform Modeling and Generation Using Hierarchical Recurrent Neural Networks for Speech Bandwidth Extension
This paper presents a waveform modeling and generation method using
hierarchical recurrent neural networks (HRNN) for speech bandwidth extension
(BWE). Different from conventional BWE methods which predict spectral
parameters for reconstructing wideband speech waveforms, this BWE method models
and predicts waveform samples directly without using vocoders. Inspired by
SampleRNN which is an unconditional neural audio generator, the HRNN model
represents the distribution of each wideband or high-frequency waveform sample
conditioned on the input narrowband waveform samples using a neural network
composed of long short-term memory (LSTM) layers and feed-forward (FF) layers.
The LSTM layers form a hierarchical structure and each layer operates at a
specific temporal resolution to efficiently capture long-span dependencies
between temporal sequences. Furthermore, additional conditions, such as the
bottleneck (BN) features derived from narrowband speech using a deep neural
network (DNN)-based state classifier, are employed as auxiliary input to
further improve the quality of generated wideband speech. The experimental
results of comparing several waveform modeling methods show that the HRNN-based
method can achieve better speech quality and run-time efficiency than the
dilated convolutional neural network (DCNN)-based method and the plain
sample-level recurrent neural network (SRNN)-based method. Our proposed method
also outperforms the conventional vocoder-based BWE method using LSTM-RNNs in
terms of the subjective quality of the reconstructed wideband speech.Comment: Accepted by IEEE Transactions on Audio, Speech and Language
Processin