2,450 research outputs found

    ARSTREAM: A Neural Network Model of Auditory Scene Analysis and Source Segregation

    Full text link
    Multiple sound sources often contain harmonics that overlap and may be degraded by environmental noise. The auditory system is capable of teasing apart these sources into distinct mental objects, or streams. Such an "auditory scene analysis" enables the brain to solve the cocktail party problem. A neural network model of auditory scene analysis, called the AIRSTREAM model, is presented to propose how the brain accomplishes this feat. The model clarifies how the frequency components that correspond to a give acoustic source may be coherently grouped together into distinct streams based on pitch and spatial cues. The model also clarifies how multiple streams may be distinguishes and seperated by the brain. Streams are formed as spectral-pitch resonances that emerge through feedback interactions between frequency-specific spectral representaion of a sound source and its pitch. First, the model transforms a sound into a spatial pattern of frequency-specific activation across a spectral stream layer. The sound has multiple parallel representations at this layer. A sound's spectral representation activates a bottom-up filter that is sensitive to harmonics of the sound's pitch. The filter activates a pitch category which, in turn, activate a top-down expectation that allows one voice or instrument to be tracked through a noisy multiple source environment. Spectral components are suppressed if they do not match harmonics of the top-down expectation that is read-out by the selected pitch, thereby allowing another stream to capture these components, as in the "old-plus-new-heuristic" of Bregman. Multiple simultaneously occuring spectral-pitch resonances can hereby emerge. These resonance and matching mechanisms are specialized versions of Adaptive Resonance Theory, or ART, which clarifies how pitch representations can self-organize durin learning of harmonic bottom-up filters and top-down expectations. The model also clarifies how spatial location cues can help to disambiguate two sources with similar spectral cures. Data are simulated from psychophysical grouping experiments, such as how a tone sweeping upwards in frequency creates a bounce percept by grouping with a downward sweeping tone due to proximity in frequency, even if noise replaces the tones at their interection point. Illusory auditory percepts are also simulated, such as the auditory continuity illusion of a tone continuing through a noise burst even if the tone is not present during the noise, and the scale illusion of Deutsch whereby downward and upward scales presented alternately to the two ears are regrouped based on frequency proximity, leading to a bounce percept. Since related sorts of resonances have been used to quantitatively simulate psychophysical data about speech perception, the model strengthens the hypothesis the ART-like mechanisms are used at multiple levels of the auditory system. Proposals for developing the model to explain more complex streaming data are also provided.Air Force Office of Scientific Research (F49620-01-1-0397, F49620-92-J-0225); Office of Naval Research (N00014-01-1-0624); Advanced Research Projects Agency (N00014-92-J-4015); British Petroleum (89A-1204); National Science Foundation (IRI-90-00530); American Society of Engineering Educatio

    Exploiting correlogram structure for robust speech recognition with multiple speech sources

    Get PDF
    This paper addresses the problem of separating and recognising speech in a monaural acoustic mixture with the presence of competing speech sources. The proposed system treats sound source separation and speech recognition as tightly coupled processes. In the first stage sound source separation is performed in the correlogram domain. For periodic sounds, the correlogram exhibits symmetric tree-like structures whose stems are located on the delay that corresponds to multiple pitch periods. These pitch-related structures are exploited in the study to group spectral components at each time frame. Local pitch estimates are then computed for each spectral group and are used to form simultaneous pitch tracks for temporal integration. These processes segregate a spectral representation of the acoustic mixture into several time-frequency regions such that the energy in each region is likely to have originated from a single periodic sound source. The identified time-frequency regions, together with the spectral representation, are employed by a `speech fragment decoder' which employs `missing data' techniques with clean speech models to simultaneously search for the acoustic evidence that best matches model sequences. The paper presents evaluations based on artificially mixed simultaneous speech utterances. A coherence-measuring experiment is first reported which quantifies the consistency of the identified fragments with a single source. The system is then evaluated in a speech recognition task and compared to a conventional fragment generation approach. Results show that the proposed system produces more coherent fragments over different conditions, which results in significantly better recognition accuracy

    ARSTREAM: A Neural Network Model of Auditory Scene Analysis and Source Segregation

    Full text link
    Multiple sound sources often contain harmonics that overlap and may be degraded by environmental noise. The auditory system is capable of teasing apart these sources into distinct mental objects, or streams. Such an "auditory scene analysis" enables the brain to solve the cocktail party problem. A neural network model of auditory scene analysis, called the AIRSTREAM model, is presented to propose how the brain accomplishes this feat. The model clarifies how the frequency components that correspond to a give acoustic source may be coherently grouped together into distinct streams based on pitch and spatial cues. The model also clarifies how multiple streams may be distinguishes and seperated by the brain. Streams are formed as spectral-pitch resonances that emerge through feedback interactions between frequency-specific spectral representaion of a sound source and its pitch. First, the model transforms a sound into a spatial pattern of frequency-specific activation across a spectral stream layer. The sound has multiple parallel representations at this layer. A sound's spectral representation activates a bottom-up filter that is sensitive to harmonics of the sound's pitch. The filter activates a pitch category which, in turn, activate a top-down expectation that allows one voice or instrument to be tracked through a noisy multiple source environment. Spectral components are suppressed if they do not match harmonics of the top-down expectation that is read-out by the selected pitch, thereby allowing another stream to capture these components, as in the "old-plus-new-heuristic" of Bregman. Multiple simultaneously occuring spectral-pitch resonances can hereby emerge. These resonance and matching mechanisms are specialized versions of Adaptive Resonance Theory, or ART, which clarifies how pitch representations can self-organize durin learning of harmonic bottom-up filters and top-down expectations. The model also clarifies how spatial location cues can help to disambiguate two sources with similar spectral cures. Data are simulated from psychophysical grouping experiments, such as how a tone sweeping upwards in frequency creates a bounce percept by grouping with a downward sweeping tone due to proximity in frequency, even if noise replaces the tones at their interection point. Illusory auditory percepts are also simulated, such as the auditory continuity illusion of a tone continuing through a noise burst even if the tone is not present during the noise, and the scale illusion of Deutsch whereby downward and upward scales presented alternately to the two ears are regrouped based on frequency proximity, leading to a bounce percept. Since related sorts of resonances have been used to quantitatively simulate psychophysical data about speech perception, the model strengthens the hypothesis the ART-like mechanisms are used at multiple levels of the auditory system. Proposals for developing the model to explain more complex streaming data are also provided.Air Force Office of Scientific Research (F49620-01-1-0397, F49620-92-J-0225); Office of Naval Research (N00014-01-1-0624); Advanced Research Projects Agency (N00014-92-J-4015); British Petroleum (89A-1204); National Science Foundation (IRI-90-00530); American Society of Engineering Educatio

    A Neural Network Model of Auditory Scene Anaysis and Source Segregation

    Full text link
    In environments with multiple sound sources, the auditory system is capable of teasing apart the impinging jumbled signal into different mental objects, or streams, as in its ability to solve the cocktail party problem. A neural network model of auditory scene analysis, called the ARTSTREAM model, is presented that groups different frequency components based on pitch and spatial location cues, and selectively allocates the components to different streams. The grouping is accomplished through a resonance that develops between a given object's pitch, its harmonic spectral components, and (to a lesser extent) its spatial location. Those spectral components that are not reinforced by being rnatched with the top-down prototype read-out by the selected object's pitch representation are suppressed, thereby allowing another stream to capture these components, as in the "old-plus-new heuristic" of Bregman. These resonance and matching mechanisms are specialized versions of Adaptive Resonance Theory, or ART, mechanisms. The model is used to simulate data from psychophysical grouping experiments, such as how a. tone sweeping upwards in frequency creates a bounce percept by grouping with a downward sweeping tone clue to proximity in frequency, even if noise replaces the tones at their intersection point. The model also simulates illusory auditory percepts such as the auditory continuity illusion of a tone continuing through a noise burst even if the tone is not present during the noise, and the scale illusion of Deutsch whereby downward and upward scales presented alternately to the two ears are regrouped based on frequency proximity, leading to a bounce percept. The stream resonances provide the coherence that allows one voice or instrument to be tracked through a multiple source environment.Air Force Office of Scientific Research (F49620-92-J-0225, F49620-92-J-0225); National Science Foundation (IRI-90-00530); Office of Naval Research (N00014-92-J-4015); British Petroleum (89A-1204

    Understanding concurrent earcons: applying auditory scene analysis principles to concurrent earcon recognition

    Get PDF
    Two investigations into the identification of concurrently presented, structured sounds, called earcons were carried out. One of the experiments investigated how varying the number of concurrently presented earcons affected their identification. It was found that varying the number had a significant effect on the proportion of earcons identified. Reducing the number of concurrently presented earcons lead to a general increase in the proportion of presented earcons successfully identified. The second experiment investigated how modifying the earcons and their presentation, using techniques influenced by auditory scene analysis, affected earcon identification. It was found that both modifying the earcons such that each was presented with a unique timbre, and altering their presentation such that there was a 300 ms onset-to-onset time delay between each earcon were found to significantly increase identification. Guidelines were drawn from this work to assist future interface designers when incorporating concurrently presented earcons

    Single-Microphone Speech Separation: The use of Speech Models

    Get PDF

    A computational framework for sound segregation in music signals

    Get PDF
    Tese de doutoramento. Engenharia Electrotécnica e de Computadores. Faculdade de Engenharia. Universidade do Porto. 200
    • 

    corecore