673 research outputs found
Wavenet based low rate speech coding
Traditional parametric coding of speech facilitates low rate but provides
poor reconstruction quality because of the inadequacy of the model used. We
describe how a WaveNet generative speech model can be used to generate high
quality speech from the bit stream of a standard parametric coder operating at
2.4 kb/s. We compare this parametric coder with a waveform coder based on the
same generative model and show that approximating the signal waveform incurs a
large rate penalty. Our experiments confirm the high performance of the WaveNet
based coder and show that the speech produced by the system is able to
additionally perform implicit bandwidth extension and does not significantly
impair recognition of the original speaker for the human listener, even when
that speaker has not been used during the training of the generative model.Comment: 5 pages, 2 figure
Low Bit-Rate Speech Coding with VQ-VAE and a WaveNet Decoder
In order to efficiently transmit and store speech signals, speech codecs
create a minimally redundant representation of the input signal which is then
decoded at the receiver with the best possible perceptual quality. In this work
we demonstrate that a neural network architecture based on VQ-VAE with a
WaveNet decoder can be used to perform very low bit-rate speech coding with
high reconstruction quality. A prosody-transparent and speaker-independent
model trained on the LibriSpeech corpus coding audio at 1.6 kbps exhibits
perceptual quality which is around halfway between the MELP codec at 2.4 kbps
and AMR-WB codec at 23.05 kbps. In addition, when training on high-quality
recorded speech with the test speaker included in the training set, a model
coding speech at 1.6 kbps produces output of similar perceptual quality to that
generated by AMR-WB at 23.05 kbps.Comment: ICASSP 201
Collapsed speech segment detection and suppression for WaveNet vocoder
In this paper, we propose a technique to alleviate the quality degradation
caused by collapsed speech segments sometimes generated by the WaveNet vocoder.
The effectiveness of the WaveNet vocoder for generating natural speech from
acoustic features has been proved in recent works. However, it sometimes
generates very noisy speech with collapsed speech segments when only a limited
amount of training data is available or significant acoustic mismatches exist
between the training and testing data. Such a limitation on the corpus and
limited ability of the model can easily occur in some speech generation
applications, such as voice conversion and speech enhancement. To address this
problem, we propose a technique to automatically detect collapsed speech
segments. Moreover, to refine the detected segments, we also propose a waveform
generation technique for WaveNet using a linear predictive coding constraint.
Verification and subjective tests are conducted to investigate the
effectiveness of the proposed techniques. The verification results indicate
that the detection technique can detect most collapsed segments. The subjective
evaluations of voice conversion demonstrate that the generation technique
significantly improves the speech quality while maintaining the same speaker
similarity.Comment: 5 pages, 6 figures. Proc. Interspeech, 201
- …