2 research outputs found

    Significance of Phase in DNN based speech enhancement algorithms

    No full text
    Most of the speech enhancement algorithms rely on estimating the magnitude spectrum of the clean speech signal from that of the noisy speech signal using either spectral regression or spectral masking. Because of difficulty in processing the phase of the short time Fourier transform (STFT), noisy phase is reused while synthesizing the waveform from the enhanced magnitude spectrum. In order to demonstrate the significance of phase in speech enhancement, we compare the phase obtained from different reconstruction methods, like Griffin and Lim, minimum phase, with that of the gold phase (clean phase). In this work, spectral magnitude mask (SMM) is estimated using deep neural networks to enhance the magnitude spectrum of the speech signal. The experimental results showed that gold phase outperforms the phase reconstruction methods in all the objective measures, illustrating the significance of enhancing the noisy phase in speech enhancement

    Neural Comb Filtering Using Sliding Window Attention Network for Speech Enhancement

    No full text
    In this paper, we demonstrate the significance of restoring harmonics of the fundamental frequency (pitch) in the deep neural network (DNN)-based speech enhancement. The parameters of the DNN can be estimated by minimizing the mask loss, but it does not restore the pitch harmonics, especially at higher frequencies. In this paper, we propose to restore the pitch harmonics in the spectral domain by minimizing cepstral loss around the pitch peak. Restoring the cepstral pitch peak, in turn, helps in restoring the pitch harmonics in the enhanced spectrum. The proposed cepstral pitch-peak loss acts as an adaptive comb filter on voiced segments and emphasizes the pitch harmonics in the speech spectrum. The network parameters are estimated using a combination of mask loss and cepstral pitch-peak loss. We show that this combination offers the complementary advantages of enhancing both the voiced and unvoiced regions. The DNN-based methods primarily rely on the network architecture, and hence, the prediction accuracy improves with the increasing complexity of the architecture. The lower complex models are essential for real-time processing systems. In this work, we propose a compact model using a sliding-window attention network (SWAN). The SWAN is trained to regress the spectral magnitude mask (SMM) from the noisy speech signal. Our experimental results demonstrate that the proposed approach achieves comparable performance with the state-of-the-art noncausal and causal speech enhancement methods with much lesser computational complexity. Our three-layered noncausal SWAN achieves 2.99 PESQ on the Valentini database with only 10 9 floating-point operations (FLOPs). © 2022, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature
    corecore