103 research outputs found
Speech Enhancement Using Speech Synthesis Techniques
Traditional speech enhancement systems reduce noise by modifying the noisy signal to make it more like a clean signal, which suffers from two problems: under-suppression of noise and over-suppression of speech. These problems create distortions in enhanced speech and hurt the quality of the enhanced signal. We propose to utilize speech synthesis techniques for a higher quality speech enhancement system. Synthesizing clean speech based on the noisy signal could produce outputs that are both noise-free and high quality. We first show that we can replace the noisy speech with its clean resynthesis from a previously recorded clean speech dictionary from the same speaker (concatenative resynthesis). Next, we show that using a speech synthesizer (vocoder) we can create a clean resynthesis of the noisy speech for more than one speaker. We term this parametric resynthesis (PR). PR can generate better prosody from noisy speech than a TTS system which uses textual information only. Additionally, we can use the high quality speech generation capability of neural vocoders for better quality speech enhancement. When trained on data from enough speakers, these vocoders can generate speech from unseen speakers, both male, and female, with similar quality as seen speakers in training. Finally, we show that using neural vocoders we can achieve better objective signal and overall quality than the state-of-the-art speech enhancement systems and better subjective quality than an oracle mask-based system
Resynthesis of Spatial Room Impulse Response tails With anisotropic multi-slope decays
Spatial room impulse responses (SRIRs) capture room acoustics with directional information. SRIRs measured in coupled rooms and spaces with non-uniform absorption distribution may exhibit anisotropic reverberation decays and multiple decay slopes. However, noisy measurements with low signal-to-noise ratios pose issues in analysis and reproduction in practice. This paper presents a method for resynthesis of the late decay of anisotropic SRIRs, effectively removing noise from SRIR measurements. The method accounts for both multi-slope decays and directional reverberation. A spherical filter bank extracts directionally constrained signals from Ambisonic input, which are then analyzed and parameterized in terms of multiple exponential decays and a noise floor. The noisy late reverberation is then resynthesized from the estimated parameters using modal synthesis, and the restored SRIR is reconstructed as Ambisonic signals. The method is evaluated both numerically and perceptually, which shows that SRIRs can be denoised with minimal error as long as parts of the decay slope are above the noise level, with signal-to-noise ratios as low as 40 dB in the presented experiment. The method can be used to increase the perceived spatial audio quality of noise-impaired SRIRs.Peer reviewe
Generative De-Quantization for Neural Speech Codec via Latent Diffusion
In low-bitrate speech coding, end-to-end speech coding networks aim to learn
compact yet expressive features and a powerful decoder in a single network. A
challenging problem as such results in unwelcome complexity increase and
inferior speech quality. In this paper, we propose to separate the
representation learning and information reconstruction tasks. We leverage an
end-to-end codec for learning low-dimensional discrete tokens and employ a
latent diffusion model to de-quantize coded features into a high-dimensional
continuous space, relieving the decoder's burden of de-quantizing and
upsampling. To mitigate the issue of over-smooth generation, we introduce
midway-infilling with less noise reduction and stronger conditioning. In
ablation studies, we investigate the hyperparameters for midway-infilling and
latent diffusion space with different dimensions. Subjective listening tests
show that our model outperforms the state-of-the-art at two low bitrates, 1.5
and 3 kbps. Codes and samples of this work are available on our webpage.Comment: Submitted to ICASSP 202
Analysis, Visualization, and Transformation of Audio Signals Using Dictionary-based Methods
date-added: 2014-01-07 09:15:58 +0000 date-modified: 2014-01-07 09:15:58 +0000date-added: 2014-01-07 09:15:58 +0000 date-modified: 2014-01-07 09:15:58 +000
- …