10 research outputs found

    Non-Intrusive Speech Intelligibility Prediction

    Get PDF

    Nonintrusive Speech Intelligibility Prediction Using Convolutional Neural Networks

    Get PDF

    Multichannel Online Dereverberation based on Spectral Magnitude Inverse Filtering

    Full text link
    This paper addresses the problem of multichannel online dereverberation. The proposed method is carried out in the short-time Fourier transform (STFT) domain, and for each frequency band independently. In the STFT domain, the time-domain room impulse response is approximately represented by the convolutive transfer function (CTF). The multichannel CTFs are adaptively identified based on the cross-relation method, and using the recursive least square criterion. Instead of the complex-valued CTF convolution model, we use a nonnegative convolution model between the STFT magnitude of the source signal and the CTF magnitude, which is just a coarse approximation of the former model, but is shown to be more robust against the CTF perturbations. Based on this nonnegative model, we propose an online STFT magnitude inverse filtering method. The inverse filters of the CTF magnitude are formulated based on the multiple-input/output inverse theorem (MINT), and adaptively estimated based on the gradient descent criterion. Finally, the inverse filtering is applied to the STFT magnitude of the microphone signals, obtaining an estimate of the STFT magnitude of the source signal. Experiments regarding both speech enhancement and automatic speech recognition are conducted, which demonstrate that the proposed method can effectively suppress reverberation, even for the difficult case of a moving speaker.Comment: Paper submitted to IEEE/ACM Transactions on Audio, Speech and Language Processing. IEEE Signal Processing Letters, 201

    Máscaras tempo-frequência para a redução de ruído aditivo em implantes cocleares

    Get PDF
    Tese (doutorado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Engenharia Elétrica, Florianópolis, 2019.Implantes cocleares (IC) são dispositivos que, a partir da estimulação elétrica do nervo auditivo, permitem a restituição parcial da audição em indivíduos com surdez profunda. Apesar de fornecerem uma informação limitada em resolução tanto no tempo quanto na frequência, seus usuários chegam a atingir índices de cerca de 80% de inteligibilidade da fala. Entretanto, esse desempenho cai significativamente na presença de ruído, o que caracteriza a maior parte dos cenários acústicos quotidianos. Técnicas de processamento de sinais para a redução de ruído se apresentam como uma alternativa para melhorar a percepção acústica de usuários de implante coclear. As principais técnicas propostas para redução de ruído em implantes cocleares consistem de máscaras tempo-frequência, destacando-se a máscara binária (BM), o filtro de Wiener (WF) e suas variantes (paramétrico e restrito). Neste trabalho, uma nova teoria unificada de máscaras tempo-frequência é apresentada. A partir do ajuste de dois parâmetros, diferentes funções de supressão podem ser realizadas, dentre as quais, algumas máscaras bem estabelecidas, tais como a máscara binária e o filtro de Wiener. Uma vantagem adicional da teoria proposta é que as máscaras derivadas por esse método são de alguma forma ótimas, diferentemente do que acontece com algumas propostas empíricas, como o filtro de Wiener paramétrico (WP). Além disso, a máscara proposta pode ser ajustada de maneira mais abrangente que o WP. Simulações numéricas extensivas mostram que a máscara proposta e a WP podem trazer melhorias na percepção de fala por usuários de IC em ambientes ruidosos. Entretanto, o desenvolvimento dessas máscaras não leva em conta características específicas do dispositivo. A maior parte dos ICs apresenta ao usuário apenas a informação de envelope temporal do sinal, ignorando totalmente a informação de fase. Nesse contexto, um novo filtro no domínio do tempo é proposto de forma a estimar o envelope de cada sub-banda da fala. Simulações numéricas indicam que o filtro proposto leva a estimações melhores do envelope em relação ao WF. Resultados de experimentos psicoacústicos tanto com normouvintes usando um simulador de IC, quanto com usuários de IC, indicam que a o estimador de envelope proposto leva a maiores valores de inteligibilidade em relação ao WF, sobretudo para sinais com SNR < ?5dB.Abstract: Cochlear implants (IC) are devices that partially restore hearing in subjects with severe deafness, this occurs through electrical stimulation of the auditory nerve. Even though the provided information is limited, due to poor time and frequency resolution, cochlear implant users may score up to 80% in speech intelligibility experiments. However, this performance is significantly reduced in presence of noise, which is the case in most everyday acoustic scenarios. Noise reduction techniques are generally applied to enhance acoustic perception by cochlear implant users. The main proposed techniques consist of time-frequency masks, such as the binary mask (BM), and the Wiener filter (WF) and its variations (parametric and constrained). In this work, a new unified theory for time-frequency masks is presented. By setting two parameters, different suppression functions may be realized, comprising well-established masks, such as the BM and the WF. Another advantage of the proposed theory is that the masks derived from it are somehow optimal, differently from heuristic masks such as the parametric Wiener filter (WP). Besides, the proposed mask can be adjusted within a wider range of suppression functions than the WP. Extensive numerical simulations show that the proposed mask and the WP may provide benefits to IC users perception in noisy environments. Nevertheless, those masks do not take into account specific IC characteristics. Most IC devices present only the signal?s temporal envelope information to the user, regardless of phase information. Thus, a new time-domain filter is proposed in order to estimate the speech signal?s temporal envelope. Numerical simulations show that this second proposed filter leads to better estimates of the speech envelope, compared to the WF. Psychoacoustical experiments with normal hearing subjects using an IC simulator, as well as with actual IC users indicate that the proposed envelope estimator leads to better intelligibility results when compared with the WF, mainly for signals corrupted at SNR < ?5dB

    A non-intrusive method for estimating binaural speech intelligibility from noise-corrupted signals captured by a pair of microphones

    Get PDF
    A non-intrusive method is introduced to predict binaural speech intelligibility in noise directly from signals captured using a pair of microphones. The approach combines signal processing techniques in blind source separation and localisation, with an intrusive objective intelligibility measure (OIM). Therefore, unlike classic intrusive OIMs, this method does not require a clean reference speech signal and knowing the location of the sources to operate. The proposed approach is able to estimate intelligibility in stationary and fluctuating noises, when the noise masker is presented as a point or diffused source, and is spatially separated from the target speech source on a horizontal plane. The performance of the proposed method was evaluated in two rooms. When predicting subjective intelligibility measured as word recognition rate, this method showed reasonable predictive accuracy with correlation coefficients above 0.82, which is comparable to that of a reference intrusive OIM in most of the conditions. The proposed approach offers a solution for fast binaural intelligibility prediction, and therefore has practical potential to be deployed in situations where on-site speech intelligibility is a concern

    Speech Intelligibility Prediction for Hearing Aid Systems

    Get PDF

    Non-intrusive speech quality prediction using modulation energies and LSTM-network

    Get PDF
    Many signal processing algorithms have been proposed to improve the quality of speech recorded in the presence of noise and reverberation. Perceptual measures, i.e., listening tests, are usually considered the most reliable way to evaluate the quality of speech processed by such algorithms but are costly and time-consuming. Consequently, speech enhancement algorithms are often evaluated using signal-based measures, which can be either intrusive or non-intrusive. As the computation of intrusive measures requires a reference signal, only non-intrusive measures can be used in applications for which the clean speech signal is not available. However, many existing non-intrusive measures correlate poorly with the perceived speech quality, particularly when applied over a wide range of algorithms or acoustic conditions. In this paper, we propose a novel non-intrusive measure of the quality of processed speech that combines modulation energy features and a recurrent neural network using long short-term memory cells. We collected a dataset of perceptually evaluated signals representing several acoustic conditions and algorithms and used this dataset to train and evaluate the proposed measure. Results show that the proposed measure yields higher correlation with perceptual speech quality than that of benchmark intrusive and non-intrusive measures when considering various categories of algorithms. Although the proposed measure is sensitive to mismatch between training and testing, results show that it is a useful approach to evaluate specific algorithms over a wide range of acoustic conditions and may, thus, become particularly useful for real-time selection of speech enhancement algorithm settings

    Blind MultiChannel Identification and Equalization for Dereverberation and Noise Reduction based on Convolutive Transfer Function

    Get PDF
    This paper addresses the problems of blind channel identification and multichannel equalization for speech dereverberation and noise reduction. The time-domain cross-relation method is not suitable for blind room impulse response identification, due to the near-common zeros of the long impulse responses. We extend the cross-relation method to the short-time Fourier transform (STFT) domain, in which the time-domain impulse responses are approximately represented by the convolutive transfer functions (CTFs) with much less coefficients. The CTFs suffer from the common zeros caused by the oversampled STFT. We propose to identify CTFs based on the STFT with the oversampled signals and the critical sampled CTFs, which is a good compromise between the frequency aliasing of the signals and the common zeros problem of CTFs. In addition, a normalization of the CTFs is proposed to remove the gain ambiguity across sub-bands. In the STFT domain, the identified CTFs is used for multichannel equalization, in which the sparsity of speech signals is exploited. We propose to perform inverse filtering by minimizing the 1\ell_1-norm of the source signal with the relaxed 2\ell_2-norm fitting error between the micophone signals and the convolution of the estimated source signal and the CTFs used as a constraint. This method is advantageous in that the noise can be reduced by relaxing the 2\ell_2-norm to a tolerance corresponding to the noise power, and the tolerance can be automatically set. The experiments confirm the efficiency of the proposed method even under conditions with high reverberation levels and intense noise.Comment: 13 pages, 5 figures, 5 table

    SkipConvGAN: Monaural Speech Dereverberation using Generative Adversarial Networks via Complex Time-Frequency Masking

    Full text link
    With the advancements in deep learning approaches, the performance of speech enhancing systems in the presence of background noise have shown significant improvements. However, improving the system's robustness against reverberation is still a work in progress, as reverberation tends to cause loss of formant structure due to smearing effects in time and frequency. A wide range of deep learning-based systems either enhance the magnitude response and reuse the distorted phase or enhance complex spectrogram using a complex time-frequency mask. Though these approaches have demonstrated satisfactory performance, they do not directly address the lost formant structure caused by reverberation. We believe that retrieving the formant structure can help improve the efficiency of existing systems. In this study, we propose SkipConvGAN - an extension of our prior work SkipConvNet. The proposed system's generator network tries to estimate an efficient complex time-frequency mask, while the discriminator network aids in driving the generator to restore the lost formant structure. We evaluate the performance of our proposed system on simulated and real recordings of reverberant speech from the single-channel task of the REVERB challenge corpus. The proposed system shows a consistent improvement across multiple room configurations over other deep learning-based generative adversarial frameworks.Comment: Published in: IEEE/ACM Transactions on Audio, Speech, and Language Processing ( Volume: 30

    Updating the SRMR-CI Metric for Improved Intelligibility Prediction for Cochlear Implant Users

    No full text
    corecore