6,863 research outputs found
A Phase Vocoder based on Nonstationary Gabor Frames
We propose a new algorithm for time stretching music signals based on the
theory of nonstationary Gabor frames (NSGFs). The algorithm extends the
techniques of the classical phase vocoder (PV) by incorporating adaptive
time-frequency (TF) representations and adaptive phase locking. The adaptive TF
representations imply good time resolution for the onsets of attack transients
and good frequency resolution for the sinusoidal components. We estimate the
phase values only at peak channels and the remaining phases are then locked to
the values of the peaks in an adaptive manner. During attack transients we keep
the stretch factor equal to one and we propose a new strategy for determining
which channels are relevant for reinitializing the corresponding phase values.
In contrast to previously published algorithms we use a non-uniform NSGF to
obtain a low redundancy of the corresponding TF representation. We show that
with just three times as many TF coefficients as signal samples, artifacts such
as phasiness and transient smearing can be greatly reduced compared to the
classical PV. The proposed algorithm is tested on both synthetic and real world
signals and compared with state of the art algorithms in a reproducible manner.Comment: 10 pages, 6 figure
Wavenet based low rate speech coding
Traditional parametric coding of speech facilitates low rate but provides
poor reconstruction quality because of the inadequacy of the model used. We
describe how a WaveNet generative speech model can be used to generate high
quality speech from the bit stream of a standard parametric coder operating at
2.4 kb/s. We compare this parametric coder with a waveform coder based on the
same generative model and show that approximating the signal waveform incurs a
large rate penalty. Our experiments confirm the high performance of the WaveNet
based coder and show that the speech produced by the system is able to
additionally perform implicit bandwidth extension and does not significantly
impair recognition of the original speaker for the human listener, even when
that speaker has not been used during the training of the generative model.Comment: 5 pages, 2 figure
- …