7,463 research outputs found

    Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

    Get PDF
    Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition that stills remains an important challenge. Data-driven supervised approaches, including ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks

    Blind Normalization of Speech From Different Channels

    Full text link
    We show how to construct a channel-independent representation of speech that has propagated through a noisy reverberant channel. This is done by blindly rescaling the cepstral time series by a non-linear function, with the form of this scale function being determined by previously encountered cepstra from that channel. The rescaled form of the time series is an invariant property of it in the following sense: it is unaffected if the time series is transformed by any time-independent invertible distortion. Because a linear channel with stationary noise and impulse response transforms cepstra in this way, the new technique can be used to remove the channel dependence of a cepstral time series. In experiments, the method achieved greater channel-independence than cepstral mean normalization, and it was comparable to the combination of cepstral mean normalization and spectral subtraction, despite the fact that no measurements of channel noise or reverberations were required (unlike spectral subtraction).Comment: 25 pages, 7 figure

    Impact of single-microphone dereverberation on DNN-based meeting transcription systems

    Get PDF
    Over the past few decades, a range of front-end techniques have been proposed to improve the robustness of automatic speech recognition systems against environmental distortion. While these techniques are effective for small tasks consisting of carefully designed data sets, especially when used with a classical acoustic model, there has been limited evidence that they are useful for a state-of-the-art system with large scale realistic data. This paper focuses on reverberation as a type of distortion and investigates the degree to which dereverberation processing can improve the performance of various forms of acoustic models based on deep neural networks (DNNs) in a challenging meeting transcription task using a single distant microphone. Experimental results show that dereverberation improves the recognition performance regardless of the acoustic model structure and the type of the feature vectors input into the neural networks, providing additional relative improvements of 4.7% and 4.1% to our best configured speaker-independent and speaker-adaptive DNN-based systems, respectively.Xie Chen was funded by Toshiba Research Europe Ltd, Cambridge Research Lab.This is the accepted manuscript of a paper published in the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, Issue Date: 4-9 May 2014, Written by: Yoshioka, T.; Xie Chen; Gales, M.J.F.)

    On the effect of SNR and superdirective beamforming in speaker diarisation in meetings

    Get PDF
    This paper examines the effect of sensor performance on speaker diarisation in meetings and investigates the use of more advanced beamforming techniques, beyond the typically employed delay-sum beamformer, for mitigating the effects of poorer sensor performance. We present superdirective beamforming and investigate how different time difference of arrival (TDOA) smoothing and beamforming techniques influence the performance of state-of-the-art diarisation systems. We produced and transcribed a new corpus of meetings recorded in the instrumented meeting room using a high SNR analogue and a newly developed low SNR digital MEMS microphone array (DMMA.2). This research demonstrates that TDOA smoothing has a significant effect on the diarisation error rate and that simple noise reduction and beamforming schemes suffice to overcome audio signal degradation due to the lower SNR of modern MEMS microphones. Index Terms — Speaker diarisation in meetings, digital MEMS microphone array, time difference of arrival (TDOA), superdirective beamforming 1
    • 

    corecore