87,139 research outputs found

    Responses in left inferior frontal gyrus are altered for speech‐in‐noise processing, but not for clear speech in autism

    Get PDF
    Introduction: Autistic individuals often have difficulties with recognizing what another person is saying in noisy conditions such as in a crowded classroom or a restaurant. The underlying neural mechanisms of this speech perception difficulty are unclear. In typically developed individuals, three cerebral cortex regions are particularly related to speech-in-noise perception: the left inferior frontal gyrus (IFG), the right insula, and the left inferior parietal lobule (IPL). Here, we tested whether responses in these cerebral cortex regions are altered in speech-in-noise perception in autism.Methods: Seventeen autistic adults and 17 typically developed controls (matched pairwise on age, sex, and IQ) performed an auditory-only speech recognition task during functional magnetic resonance imaging (fMRI). Speech was presented either with noise (noise condition) or without noise (no noise condition, i.e., clear speech).Results: In the left IFG, blood-oxygenation-level-dependent (BOLD) responses were higher in the control compared to the autism group for recognizing speech-in-noise compared to clear speech. For this contrast, both groups had similar response magnitudes in the right insula and left IPL. Additionally, we replicated previous findings that BOLD responses in speech-related and auditory brain regions (including bilateral superior temporal sulcus and Heschl's gyrus) for clear speech were similar in both groups and that voice identity recognition was impaired for clear and noisy speech in autism.Discussion: Our findings show that in autism, the processing of speech is particularly reduced under noisy conditions in the left IFG-a dysfunction that might be important in explaining restricted speech comprehension in noisy environments

    Speech denoising using nonnegative matrix factorization and neural networks

    Get PDF
    The main goal of this research is to do source separation of single-channel mixed signals such that we get a clean representation of each source. In our case, we are concerned specifically with separating speech of a speaker from background noise as another source. So we deal with single-channel mixtures of speech with stationary, semi-stationary and non-stationary noise types. This is what we define as speech denoising. Our goal is to build a system to which we input a noisy speech signal and get the clean speech out with as little distortion or artifacts as possible. The model requires no prior information about the speaker or the background noise. The separation is done in real-time as we can feed the input signal on a frame-by-frame basis. This model can be used in speech recognition systems to improve recognition accuracy in noisy environments. Two methods were mainly adopted for this purpose, nonnegative matrix factorization (NMF) and neural networks. Experiments were conducted to compare the performance of these two methods for speech denoising. For each of these methods, we compared the performance of the case where we had prior information of both the speaker and noise to having just a general speech dictionary. Also, some experiments were conducted to compare the different architectures and parameters in each of these approaches

    A Comparative Study of Computational Models of Auditory Peripheral System

    Full text link
    A deep study about the computational models of the auditory peripheral system from three different research groups: Carney, Meddis and Hemmert, is presented here. The aim is to find out which model fits the data best and which properties of the models are relevant for speech recognition. To get a first approximation, different tests with tones have been performed with seven models. Then we have evaluated the results of these models in the presence of speech. Therefore, two models were studied deeply through an automatic speech recognition (ASR) system, in clean and noisy background and for a diversity of sound levels. The post stimulus time histogram help us to see how the models that improved the offset adaptation present the ¿dead time¿. For its part, the synchronization evaluation for tones and modulated signals, have highlighted the better result from the models with offset adaptation. Finally, tuning curves and Q10dB (added to ASR results) on contrary have indicated that the selectivity is not a property needed for speech recognition. Besides the evaluation of the models with ASR have demonstrated the outperforming of models with offset adaptation and the triviality of using cat or human tuning for speech recognition. With this results, we conclude that mostly the model that better fits the data is the one described by Zilany et al. (2009) and the property unquestionable for speech recognition would be a good offset adaptation that offers a better synchronization and a better ASR result. For ASR system it makes no big difference if offset adaptation comes from a shift of the auditory nerve response or from a power law adaptation in the synapse.Vendrell Llopis, N. (2010). A Comparative Study of Computational Models of Auditory Peripheral System. http://hdl.handle.net/10251/20433.Archivo delegad

    Speech‑derived haptic stimulation enhances speech recognition in a multi‑talker background

    Get PDF
    Published: 03 October 2023Speech understanding, while effortless in quiet conditions, is challenging in noisy environments. Previous studies have revealed that a feasible approach to supplement speech-in-noise (SiN) perception consists in presenting speech-derived signals as haptic input. In the current study, we investigated whether the presentation of a vibrotactile signal derived from the speech temporal envelope can improve SiN intelligibility in a multi-talker background for untrained, normal-hearing listeners. We also determined if vibrotactile sensitivity, evaluated using vibrotactile detection thresholds, modulates the extent of audio-tactile SiN improvement. In practice, we measured participants’ speech recognition in a multi-talker noise without (audio-only) and with (audio-tactile) concurrent vibrotactile stimulation delivered in three schemes: to the left or right palm, or to both. Averaged across the three stimulation delivery schemes, the vibrotactile stimulation led to a significant improvement of 0.41 dB in SiN recognition when compared to the audio-only condition. Notably, there were no significant differences observed between the improvements in these delivery schemes. In addition, audio-tactile SiN benefit was significantly predicted by participants’ vibrotactile threshold levels and unimodal (audio-only) SiN performance. The extent of the improvement afforded by speech-envelope-derived vibrotactile stimulation was in line with previously uncovered vibrotactile enhancements of SiN perception in untrained listeners with no known hearing impairment. Overall, these results highlight the potential of concurrent vibrotactile stimulation to improve SiN recognition, especially in individuals with poor SiN perception abilities, and tentatively more so with increasing tactile sensitivity. Moreover, they lend support to the multimodal accounts of speech perception and research on tactile speech aid devices.I. Sabina Răutu is supported by the Fonds pour la formation à la recherche dans l’industrie et l’agriculture (FRIA), Fonds de la Recherche Scientifique (FRS-FNRS), Brussels, Belgium. Xavier De Tiège is Clinical Researcher at the FRS-FNRS. This research project has been supported by the Fonds Erasme (Research convention “Les Voies du Savoir 2”, Brussels, Belgium)

    Long-term musical experience and auditory and visual perceptual abilities under adverse conditions

    Get PDF
    Musicians have been shown to have enhanced speech perception in noise skills. It is unclear whether these improvements are limited to the auditory modality, as no research has examined musicians' visual perceptual abilities under degraded conditions. The current study examined associations between long-term musical experience and visual perception under noisy or degraded conditions. The performance of 11 musicians and 11 age-matched nonmusicians was compared on several auditory and visual perceptions in noise measures. Auditory perception tests included speech-in-noise tests and an environmental sound in noise test. Visual perception tasks included a fragmented sentences task, an object recognition task, and a lip-reading measure. Participants' vocabulary knowledge and nonverbal reasoning abilities were also assessed. Musicians outperformed nonmusicians on the speech perception in noise measures as well as the visual fragmented sentences task. Musicians also displayed better vocabulary knowledge in comparison to nonmusicians. Associations were found between perception of speech and visually degraded text. The findings show that long-term musical experience is associated with modality-general improvements in perceptual abilities. Possible systems supporting musicians' perceptual abilities are discussed

    Single-Channel Speech Enhancement Based on Deep Neural Networks

    Get PDF
    Speech enhancement (SE) aims to improve the speech quality of the degraded speech. Recently, researchers have resorted to deep-learning as a primary tool for speech enhancement, which often features deterministic models adopting supervised training. Typically, a neural network is trained as a mapping function to convert some features of noisy speech to certain targets that can be used to reconstruct clean speech. These methods of speech enhancement using neural networks have been focused on the estimation of spectral magnitude of clean speech considering that estimating spectral phase with neural networks is difficult due to the wrapping effect. As an alternative, complex spectrum estimation implicitly resolves the phase estimation problem and has been proven to outperform spectral magnitude estimation. In the first contribution of this thesis, a fully convolutional neural network (FCN) is proposed for complex spectrogram estimation. Stacked frequency-dilated convolution is employed to obtain an exponential growth of the receptive field in frequency domain. The proposed network also features an efficient implementation that requires much fewer parameters as compared with conventional deep neural network (DNN) and convolutional neural network (CNN) while still yielding a comparable performance. Consider that speech enhancement is only useful in noisy conditions, yet conventional SE methods often do not adapt to different noisy conditions. In the second contribution, we proposed a model that provides an automatic "on/off" switch for speech enhancement. It is capable of scaling its computational complexity under different signal-to-noise ratio (SNR) levels by detecting clean or near-clean speech which requires no processing. By adopting information maximizing generative adversarial network (InfoGAN) in a deterministic, supervised manner, we incorporate the functionality of SNR-indicator into the model that adds little additional cost to the system. We evaluate the proposed SE methods with two objectives: speech intelligibility and application to automatic speech recognition (ASR). Experimental results have shown that the CNN-based model is applicable for both objectives while the InfoGAN-based model is more useful in terms of speech intelligibility. The experiments also show that SE for ASR may be more challenging than improving the speech intelligibility, where a series of factors, including training dataset and neural network models, would impact the ASR performance

    Robust Neural Machine Translation for Clean and Noisy Speech Transcripts

    Get PDF
    Neural machine translation models have shown to achieve high quality when trained and fed with well structured and punctuated input texts. Unfortunately, the latter condition is not met in spoken language translation, where the input is generated by an automatic speech recognition (ASR) system. In this paper, we study how to adapt a strong NMT system to make it robust to typical ASR errors. As in our application scenarios transcripts might be post-edited by human experts, we propose adaptation strategies to train a single system that can translate either clean or noisy input with no supervision on the input type. Our experimental results on a public speech translation data set show that adapting a model on a significant amount of parallel data including ASR transcripts is beneficial with test data of the same type, but produces a small degradation when translating clean text. Adapting on both clean and noisy variants of the same data leads to the best results on both input types
    corecore