664 research outputs found

    Exploiting correlogram structure for robust speech recognition with multiple speech sources

    Get PDF
    This paper addresses the problem of separating and recognising speech in a monaural acoustic mixture with the presence of competing speech sources. The proposed system treats sound source separation and speech recognition as tightly coupled processes. In the first stage sound source separation is performed in the correlogram domain. For periodic sounds, the correlogram exhibits symmetric tree-like structures whose stems are located on the delay that corresponds to multiple pitch periods. These pitch-related structures are exploited in the study to group spectral components at each time frame. Local pitch estimates are then computed for each spectral group and are used to form simultaneous pitch tracks for temporal integration. These processes segregate a spectral representation of the acoustic mixture into several time-frequency regions such that the energy in each region is likely to have originated from a single periodic sound source. The identified time-frequency regions, together with the spectral representation, are employed by a `speech fragment decoder' which employs `missing data' techniques with clean speech models to simultaneously search for the acoustic evidence that best matches model sequences. The paper presents evaluations based on artificially mixed simultaneous speech utterances. A coherence-measuring experiment is first reported which quantifies the consistency of the identified fragments with a single source. The system is then evaluated in a speech recognition task and compared to a conventional fragment generation approach. Results show that the proposed system produces more coherent fragments over different conditions, which results in significantly better recognition accuracy

    Speech enhancement using auditory filterbank.

    Get PDF
    This thesis presents a novel subband noise reduction technique for speech enhancement, termed as Adaptive Subband Wiener Filtering (ASWF), based on a critical-band gammatone filterbank. The ASWF is derived from a generalized Subband Wiener Filtering (SWF) equation and reduces noises according to the estimated signal-to-noise ratio (SNR) in each auditory channel and in each time frame. The design of a subband noise estimator, suitable for some real-life noise environments, is also presented. This denoising technique would be beneficial for some auditory-based speech and audio applications, e.g. to enhance the robustness of sound processing in cochlear implants. Comprehensive objective and subjective tests demonstrated the proposed technique is effective to improve the perceptual quality of enhanced speeches. This technique offers a time-domain noise reduction scheme using a linear filterbank structure and can be combined with other filterbank algorithms (such as for speech recognition and coding) as a front-end processing step immediately after the analysis filterbank, to increase the robustness of the respective application.Dept. of Electrical and Computer Engineering. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2005 .G85. Source: Masters Abstracts International, Volume: 44-03, page: 1452. Thesis (M.A.Sc.)--University of Windsor (Canada), 2005

    DNN-Based Multi-Frame MVDR Filtering for Single-Microphone Speech Enhancement

    Full text link
    Multi-frame approaches for single-microphone speech enhancement, e.g., the multi-frame minimum-variance-distortionless-response (MVDR) filter, are able to exploit speech correlations across neighboring time frames. In contrast to single-frame approaches such as the Wiener gain, it has been shown that multi-frame approaches achieve a substantial noise reduction with hardly any speech distortion, provided that an accurate estimate of the correlation matrices and especially the speech interframe correlation vector is available. Typical estimation procedures of the correlation matrices and the speech interframe correlation (IFC) vector require an estimate of the speech presence probability (SPP) in each time-frequency bin. In this paper, we propose to use a bi-directional long short-term memory deep neural network (DNN) to estimate a speech mask and a noise mask for each time-frequency bin, using which two different SPP estimates are derived. Aiming at achieving a robust performance, the DNN is trained for various noise types and signal-to-noise ratios. Experimental results show that the multi-frame MVDR in combination with the proposed data-driven SPP estimator yields an increased speech quality compared to a state-of-the-art model-based estimator
    • …
    corecore