327 research outputs found

    Bidirectional truncated recurrent neural networks for efficient speech denoising

    Get PDF
    We propose a bidirectional truncated recurrent neural network architecture for speech denoising. Recent work showed that deep recurrent neural networks perform well at speech denoising tasks and outperform feed forward architectures [1]. However, recurrent neural networks are difficult to train and their simulation does not allow for much parallelization. Given the increasing availability of parallel computing architectures like GPUs this is disadvantageous. The architecture we propose aims to retain the positive properties of recurrent neural networks and deep learning while remaining highly parallelizable. Unlike a standard recurrent neural network, it processes information from both past and future time steps. We evaluate two variants of this architecture on the Aurora2 task for robust ASR where they show promising results. The models outperform the ETSI2 advanced front end and the SPLICE algorithm under matching noise conditions.We propose a bidirectional truncated recurrent neural network architecture for speech denoising. Recent work showed that deep recurrent neural networks perform well at speech denoising tasks and outperform feed forward architectures [1]. However, recurrent neural networks are difficult to train and their simulation does not allow for much parallelization. Given the increasing availability of parallel computing architectures like GPUs this is disadvantageous. The architecture we propose aims to retain the positive properties of recurrent neural networks and deep learning while remaining highly parallelizable. Unlike a standard recurrent neural network, it processes information from both past and future time steps. We evaluate two variants of this architecture on the Aurora2 task for robust ASR where they show promising results. The models outperform the ETSI2 advanced front end and the SPLICE algorithm under matching noise conditions.P

    High-dimensional sequence transduction

    Full text link
    We investigate the problem of transforming an input sequence into a high-dimensional output sequence in order to transcribe polyphonic audio music into symbolic notation. We introduce a probabilistic model based on a recurrent neural network that is able to learn realistic output distributions given the input and we devise an efficient algorithm to search for the global mode of that distribution. The resulting method produces musically plausible transcriptions even under high levels of noise and drastically outperforms previous state-of-the-art approaches on five datasets of synthesized sounds and real recordings, approximately halving the test error rate

    The estimation and application of unnormalized statistical models

    Get PDF

    On the Effectiveness of Low-Rank Matrix Factorization for LSTM Model Compression

    Get PDF

    Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement

    Full text link
    Phase information has a significant impact on speech perceptual quality and intelligibility. However, existing speech enhancement methods encounter limitations in explicit phase estimation due to the non-structural nature and wrapping characteristics of the phase, leading to a bottleneck in enhanced speech quality. To overcome the above issue, in this paper, we proposed MP-SENet, a novel Speech Enhancement Network which explicitly enhances Magnitude and Phase spectra in parallel. The proposed MP-SENet adopts a codec architecture in which the encoder and decoder are bridged by time-frequency Transformers along both time and frequency dimensions. The encoder aims to encode time-frequency representations derived from the input distorted magnitude and phase spectra. The decoder comprises dual-stream magnitude and phase decoders, directly enhancing magnitude and wrapped phase spectra by incorporating a magnitude estimation architecture and a phase parallel estimation architecture, respectively. To train the MP-SENet model effectively, we define multi-level loss functions, including mean square error and perceptual metric loss of magnitude spectra, anti-wrapping loss of phase spectra, as well as mean square error and consistency loss of short-time complex spectra. Experimental results demonstrate that our proposed MP-SENet excels in high-quality speech enhancement across multiple tasks, including speech denoising, dereverberation, and bandwidth extension. Compared to existing phase-aware speech enhancement methods, it successfully avoids the bidirectional compensation effect between the magnitude and phase, leading to a better harmonic restoration. Notably, for the speech denoising task, the MP-SENet yields a state-of-the-art performance with a PESQ of 3.60 on the public VoiceBank+DEMAND dataset.Comment: Submmited to IEEE Transactions on Audio, Speech and Language Processin
    corecore