8,404 research outputs found

    Single-Channel Speech Enhancement Based on Deep Neural Networks

    Get PDF
    Speech enhancement (SE) aims to improve the speech quality of the degraded speech. Recently, researchers have resorted to deep-learning as a primary tool for speech enhancement, which often features deterministic models adopting supervised training. Typically, a neural network is trained as a mapping function to convert some features of noisy speech to certain targets that can be used to reconstruct clean speech. These methods of speech enhancement using neural networks have been focused on the estimation of spectral magnitude of clean speech considering that estimating spectral phase with neural networks is difficult due to the wrapping effect. As an alternative, complex spectrum estimation implicitly resolves the phase estimation problem and has been proven to outperform spectral magnitude estimation. In the first contribution of this thesis, a fully convolutional neural network (FCN) is proposed for complex spectrogram estimation. Stacked frequency-dilated convolution is employed to obtain an exponential growth of the receptive field in frequency domain. The proposed network also features an efficient implementation that requires much fewer parameters as compared with conventional deep neural network (DNN) and convolutional neural network (CNN) while still yielding a comparable performance. Consider that speech enhancement is only useful in noisy conditions, yet conventional SE methods often do not adapt to different noisy conditions. In the second contribution, we proposed a model that provides an automatic "on/off" switch for speech enhancement. It is capable of scaling its computational complexity under different signal-to-noise ratio (SNR) levels by detecting clean or near-clean speech which requires no processing. By adopting information maximizing generative adversarial network (InfoGAN) in a deterministic, supervised manner, we incorporate the functionality of SNR-indicator into the model that adds little additional cost to the system. We evaluate the proposed SE methods with two objectives: speech intelligibility and application to automatic speech recognition (ASR). Experimental results have shown that the CNN-based model is applicable for both objectives while the InfoGAN-based model is more useful in terms of speech intelligibility. The experiments also show that SE for ASR may be more challenging than improving the speech intelligibility, where a series of factors, including training dataset and neural network models, would impact the ASR performance

    Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

    Get PDF
    Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition that stills remains an important challenge. Data-driven supervised approaches, including ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks

    Deep Learning for Audio Signal Processing

    Full text link
    Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure
    corecore