2,359 research outputs found
End-to-end Recurrent Denoising Autoencoder Embeddings for Speaker Identification
Speech 'in-the-wild' is a handicap for speaker recognition systems due to the
variability induced by real-life conditions, such as environmental noise and
emotions in the speaker. Taking advantage of representation learning, on this
paper we aim to design a recurrent denoising autoencoder that extracts robust
speaker embeddings from noisy spectrograms to perform speaker identification.
The end-to-end proposed architecture uses a feedback loop to encode information
regarding the speaker into low-dimensional representations extracted by a
spectrogram denoising autoencoder. We employ data augmentation techniques by
additively corrupting clean speech with real life environmental noise and make
use of a database with real stressed speech. We prove that the joint
optimization of both the denoiser and the speaker identification module
outperforms independent optimization of both modules under stress and noise
distortions as well as hand-crafted features.Comment: 8 pages + 2 of references + 5 of images. Submitted on Monday 20th of
July to Elsevier Signal Processing Short Communication
Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement
Phase information has a significant impact on speech perceptual quality and
intelligibility. However, existing speech enhancement methods encounter
limitations in explicit phase estimation due to the non-structural nature and
wrapping characteristics of the phase, leading to a bottleneck in enhanced
speech quality. To overcome the above issue, in this paper, we proposed
MP-SENet, a novel Speech Enhancement Network which explicitly enhances
Magnitude and Phase spectra in parallel. The proposed MP-SENet adopts a codec
architecture in which the encoder and decoder are bridged by time-frequency
Transformers along both time and frequency dimensions. The encoder aims to
encode time-frequency representations derived from the input distorted
magnitude and phase spectra. The decoder comprises dual-stream magnitude and
phase decoders, directly enhancing magnitude and wrapped phase spectra by
incorporating a magnitude estimation architecture and a phase parallel
estimation architecture, respectively. To train the MP-SENet model effectively,
we define multi-level loss functions, including mean square error and
perceptual metric loss of magnitude spectra, anti-wrapping loss of phase
spectra, as well as mean square error and consistency loss of short-time
complex spectra. Experimental results demonstrate that our proposed MP-SENet
excels in high-quality speech enhancement across multiple tasks, including
speech denoising, dereverberation, and bandwidth extension. Compared to
existing phase-aware speech enhancement methods, it successfully avoids the
bidirectional compensation effect between the magnitude and phase, leading to a
better harmonic restoration. Notably, for the speech denoising task, the
MP-SENet yields a state-of-the-art performance with a PESQ of 3.60 on the
public VoiceBank+DEMAND dataset.Comment: Submmited to IEEE Transactions on Audio, Speech and Language
Processin
- …