1,655 research outputs found

    A physiologically inspired model for solving the cocktail party problem.

    Get PDF
    At a cocktail party, we can broadly monitor the entire acoustic scene to detect important cues (e.g., our names being called, or the fire alarm going off), or selectively listen to a target sound source (e.g., a conversation partner). It has recently been observed that individual neurons in the avian field L (analog to the mammalian auditory cortex) can display broad spatial tuning to single targets and selective tuning to a target embedded in spatially distributed sound mixtures. Here, we describe a model inspired by these experimental observations and apply it to process mixtures of human speech sentences. This processing is realized in the neural spiking domain. It converts binaural acoustic inputs into cortical spike trains using a multi-stage model composed of a cochlear filter-bank, a midbrain spatial-localization network, and a cortical network. The output spike trains of the cortical network are then converted back into an acoustic waveform, using a stimulus reconstruction technique. The intelligibility of the reconstructed output is quantified using an objective measure of speech intelligibility. We apply the algorithm to single and multi-talker speech to demonstrate that the physiologically inspired algorithm is able to achieve intelligible reconstruction of an "attended" target sentence embedded in two other non-attended masker sentences. The algorithm is also robust to masker level and displays performance trends comparable to humans. The ideas from this work may help improve the performance of hearing assistive devices (e.g., hearing aids and cochlear implants), speech-recognition technology, and computational algorithms for processing natural scenes cluttered with spatially distributed acoustic objects.R01 DC000100 - NIDCD NIH HHSPublished versio

    Permutation Invariant Training of Deep Models for Speaker-Independent Multi-talker Speech Separation

    Full text link
    We propose a novel deep learning model, which supports permutation invariant training (PIT), for speaker independent multi-talker speech separation, commonly known as the cocktail-party problem. Different from most of the prior arts that treat speech separation as a multi-class regression problem and the deep clustering technique that considers it a segmentation (or clustering) problem, our model optimizes for the separation regression error, ignoring the order of mixing sources. This strategy cleverly solves the long-lasting label permutation problem that has prevented progress on deep learning based techniques for speech separation. Experiments on the equal-energy mixing setup of a Danish corpus confirms the effectiveness of PIT. We believe improvements built upon PIT can eventually solve the cocktail-party problem and enable real-world adoption of, e.g., automatic meeting transcription and multi-party human-computer interaction, where overlapping speech is common.Comment: 5 page

    Toward the pre-cocktail party problem with TasTas++

    Full text link
    Deep neural network with dual-path bi-directional long short-term memory (BiLSTM) block has been proved to be very effective in sequence modeling, especially in speech separation, e.g. DPRNN-TasNet \cite{luo2019dual}, TasTas \cite{shi2020speech}. In this paper, we propose two improvements of TasTas \cite{shi2020speech} for end-to-end approach to monaural speech separation in pre-cocktail party problems, which consists of 1) generate new training data through the original training batch in real time, and 2) train each module in TasTas separately. The new approach is called TasTas++, which takes the mixed utterance of five speakers and map it to five separated utterances, where each utterance contains only one speaker's voice. For the objective, we train the network by directly optimizing the utterance level scale-invariant signal-to-distortion ratio (SI-SDR) in a permutation invariant training (PIT) style. Our experiments on the public WSJ0-5mix data corpus results in 11.14dB SDR improvement, which shows our proposed networks can lead to performance improvement on the speaker separation task. We have open-sourced our re-implementation of the DPRNN-TasNet in https://github.com/ShiZiqiang/dual-path-RNNs-DPRNNs-based-speech-separation, and our TasTas++ is realized based on this implementation of DPRNN-TasNet, it is believed that the results in this paper can be reproduced with ease.Comment: arXiv admin note: substantial text overlap with arXiv:2001.08998, arXiv:1902.04891, arXiv:1902.00651, arXiv:2008.0314

    Active inference, selective attention, and the cocktail party problem

    Get PDF
    In this paper, we introduce a new generative model for an active inference account of preparatory and selective attention, in the context of a classic ‘cocktail party’ paradigm. In this setup, pairs of words are presented simultaneously to the left and right ears and an instructive spatial cue directs attention to the left or right. We use this generative model to test competing hypotheses about the way that human listeners direct preparatory and selective attention. We show that assigning low precision to words at attended—relative to unattended—locations can explain why a listener reports words from a competing sentence. Under this model, temporal changes in sensory precision were not needed to account for faster reaction times with longer cue-target intervals, but were necessary to explain ramping effects on event-related potentials (ERPs)—resembling the contingent negative variation (CNV)—during the preparatory interval. These simulations reveal that different processes are likely to underlie the improvement in reaction times and the ramping of ERPs that are associated with spatial cueing

    The Genetic contribution to solving the cocktail-party problem

    Get PDF
    Communicating in everyday situations requires solving the cocktail-party problem, or segregating the acoustic mixture into its constituent sounds and attending to those of most interest. Humans show dramatic variation in this ability, leading some to experience real-world problems irrespective of whether they meet criteria for clinical hearing loss. Here, we estimated the genetic contribution to cocktail-party listening by measuring speech-reception thresholds (SRTs) in 425 people from large families and ranging in age from 18 to 91 years. Roughly half the variance of SRTs was explained by genes (h 2 = 0.567). The genetic correlation between SRTs and hearing thresholds (HTs) was medium (ρ G = 0.392), suggesting that the genetic factors influencing cocktail-party listening were partially distinct from those influencing sound sensitivity. Aging and socioeconomic status also strongly influenced SRTs. These findings may represent a first step toward identifying genes for hidden hearing loss, or hearing problems in people with normal HTs
    • 

    corecore