173 research outputs found

    Single-Microphone Speech Enhancement Inspired by Auditory System

    Get PDF
    Enhancing quality of speech in noisy environments has been an active area of research due to the abundance of applications dealing with human voice and dependence of their performance on this quality. While original approaches in the field were mostly addressing this problem in a pure statistical framework in which the goal was to estimate speech from its sum with other independent processes (noise), during last decade, the attention of the scientific community has turned to the functionality of human auditory system. A lot of effort has been put to bridge the gap between the performance of speech processing algorithms and that of average human by borrowing the models suggested for the sound processing in the auditory system. In this thesis, we will introduce algorithms for speech enhancement inspired by two of these models i.e. the cortical representation of sounds and the hypothesized role of temporal coherence in the auditory scene analysis. After an introduction to the auditory system and the speech enhancement framework we will first show how traditional speech enhancement technics such as wiener-filtering can benefit on the feature extraction level from discriminatory capabilities of spectro-temporal representation of sounds in the cortex i.e. the cortical model. We will next focus on the feature processing as opposed to the extraction stage in the speech enhancement systems by taking advantage of models hypothesized for human attention for sound segregation. We demonstrate a mask-based enhancement method in which the temporal coherence of features is used as a criterion to elicit information about their sources and more specifically to form the masks needed to suppress the noise. Lastly, we explore how the two blocks for feature extraction and manipulation can be merged into one in a manner consistent with our knowledge about auditory system. We will do this through the use of regularized non-negative matrix factorization to optimize the feature extraction and simultaneously account for temporal dynamics to separate noise from speech

    Representation of speech in the primary auditory cortex and its implications for robust speech processing

    Get PDF
    Speech has evolved as a primary form of communication between humans. This most used means of communication has been the subject of intense study for years, but there is still a lot that we do not know about it. It is an oft repeated fact, that even the performance of the best speech processing algorithms still lags far behind that of the average human, It seems inescapable that unless we know more about the way the brain performs this task, our machines can not go much further. This thesis focuses on the question of speech representation in the brain, both from a physiological and technological perspective. We explore the representation of speech through the encoding of its smallest elements - phonemic features - in the primary auditory cortex. We report on how population of neurons with diverse tuning properties respond discriminately to phonemes resulting in explicit encoding of their parameters. Next, we show that this sparse encoding of the phonemic features is a simple consequence of the linear spectro-temporal properties of the auditory cortical neurons and that a Spectro-Temporal receptive field model can predict similar patterns of activation. This is an important step toward the realization of systems that operate based on the same principles as the cortex. Using an inverse method of reconstruction, we shall also explore the extent to which phonemic features are preserved in the cortical representation of noisy speech. The results suggest that the cortical responses are more robust to noise and that the important features of phonemes are preserved in the cortical representation even in noise. Finally, we explain how a model of this cortical representation can be used for speech processing and enhancement applications to improve their robustness and performance

    Differentiable Time-Frequency Scattering on GPU

    Full text link
    Joint time-frequency scattering (JTFS) is a convolutional operator in the time-frequency domain which extracts spectrotemporal modulations at various rates and scales. It offers an idealized model of spectrotemporal receptive fields (STRF) in the primary auditory cortex, and thus may serve as a biological plausible surrogate for human perceptual judgments at the scale of isolated audio events. Yet, prior implementations of JTFS and STRF have remained outside of the standard toolkit of perceptual similarity measures and evaluation methods for audio generation. We trace this issue down to three limitations: differentiability, speed, and flexibility. In this paper, we present an implementation of time-frequency scattering in Python. Unlike prior implementations, ours accommodates NumPy, PyTorch, and TensorFlow as backends and is thus portable on both CPU and GPU. We demonstrate the usefulness of JTFS via three applications: unsupervised manifold learning of spectrotemporal modulations, supervised classification of musical instruments, and texture resynthesis of bioacoustic sounds.Comment: 8 pages, 6 figures. Submitted to the International Conference on Digital Audio Effects (DAFX) 202

    Enhanced amplitude modulations contribute to the Lombard intelligibility benefit: Evidence from the Nijmegen Corpus of Lombard Speech

    No full text
    Speakers adjust their voice when talking in noise, which is known as Lombard speech. These acoustic adjustments facilitate speech comprehension in noise relative to plain speech (i.e., speech produced in quiet). However, exactly which characteristics of Lombard speech drive this intelligibility benefit in noise remains unclear. This study assessed the contribution of enhanced amplitude modulations to the Lombard speech intelligibility benefit by demonstrating that (1) native speakers of Dutch in the Nijmegen Corpus of Lombard Speech (NiCLS) produce more pronounced amplitude modulations in noise vs. in quiet; (2) more enhanced amplitude modulations correlate positively with intelligibility in a speech-in-noise perception experiment; (3) transplanting the amplitude modulations from Lombard speech onto plain speech leads to an intelligibility improvement, suggesting that enhanced amplitude modulations in Lombard speech contribute towards intelligibility in noise. Results are discussed in light of recent neurobiological models of speech perception with reference to neural oscillators phase-locking to the amplitude modulations in speech, guiding the processing of speech

    Distinct higher-order representations of natural sounds in human and ferret auditory cortex

    Get PDF
    Little is known about how neural representations of natural sounds differ across species. For example, speech and music play a unique role in human hearing, yet it is unclear how auditory representations of speech and music differ between humans and other animals. Using functional ultrasound imaging, we measured responses in ferrets to a set of natural and spectrotemporally matched synthetic sounds previously tested in humans. Ferrets showed similar lower-level frequency and modulation tuning to that observed in humans. But while humans showed substantially larger responses to natural vs. synthetic speech and music in non-primary regions, ferret responses to natural and synthetic sounds were closely matched throughout primary and non-primary auditory cortex, even when tested with ferret vocalizations. This finding reveals that auditory representations in humans and ferrets diverge sharply at late stages of cortical processing, potentially driven by higher-order processing demands in speech and music
    • …
    corecore