1,507 research outputs found
STFT Spectral Loss for Training a Neural Speech Waveform Model
This paper proposes a new loss using short-time Fourier transform (STFT)
spectra for the aim of training a high-performance neural speech waveform model
that predicts raw continuous speech waveform samples directly. Not only
amplitude spectra but also phase spectra obtained from generated speech
waveforms are used to calculate the proposed loss. We also mathematically show
that training of the waveform model on the basis of the proposed loss can be
interpreted as maximum likelihood training that assumes the amplitude and phase
spectra of generated speech waveforms following Gaussian and von Mises
distributions, respectively. Furthermore, this paper presents a simple network
architecture as the speech waveform model, which is composed of uni-directional
long short-term memories (LSTMs) and an auto-regressive structure. Experimental
results showed that the proposed neural model synthesized high-quality speech
waveforms.Comment: Submitted to the 2019 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP
TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer
In this work, we address the problem of musical timbre transfer, where the
goal is to manipulate the timbre of a sound sample from one instrument to match
another instrument while preserving other musical content, such as pitch,
rhythm, and loudness. In principle, one could apply image-based style transfer
techniques to a time-frequency representation of an audio signal, but this
depends on having a representation that allows independent manipulation of
timbre as well as high-quality waveform generation. We introduce TimbreTron, a
method for musical timbre transfer which applies "image" domain style transfer
to a time-frequency representation of the audio signal, and then produces a
high-quality waveform using a conditional WaveNet synthesizer. We show that the
Constant Q Transform (CQT) representation is particularly well-suited to
convolutional architectures due to its approximate pitch equivariance. Based on
human perceptual evaluations, we confirmed that TimbreTron recognizably
transferred the timbre while otherwise preserving the musical content, for both
monophonic and polyphonic samples.Comment: 17 pages, published as a conference paper at ICLR 201
- …