4,535 research outputs found

    ARSTREAM: A Neural Network Model of Auditory Scene Analysis and Source Segregation

    Full text link
    Multiple sound sources often contain harmonics that overlap and may be degraded by environmental noise. The auditory system is capable of teasing apart these sources into distinct mental objects, or streams. Such an "auditory scene analysis" enables the brain to solve the cocktail party problem. A neural network model of auditory scene analysis, called the AIRSTREAM model, is presented to propose how the brain accomplishes this feat. The model clarifies how the frequency components that correspond to a give acoustic source may be coherently grouped together into distinct streams based on pitch and spatial cues. The model also clarifies how multiple streams may be distinguishes and seperated by the brain. Streams are formed as spectral-pitch resonances that emerge through feedback interactions between frequency-specific spectral representaion of a sound source and its pitch. First, the model transforms a sound into a spatial pattern of frequency-specific activation across a spectral stream layer. The sound has multiple parallel representations at this layer. A sound's spectral representation activates a bottom-up filter that is sensitive to harmonics of the sound's pitch. The filter activates a pitch category which, in turn, activate a top-down expectation that allows one voice or instrument to be tracked through a noisy multiple source environment. Spectral components are suppressed if they do not match harmonics of the top-down expectation that is read-out by the selected pitch, thereby allowing another stream to capture these components, as in the "old-plus-new-heuristic" of Bregman. Multiple simultaneously occuring spectral-pitch resonances can hereby emerge. These resonance and matching mechanisms are specialized versions of Adaptive Resonance Theory, or ART, which clarifies how pitch representations can self-organize durin learning of harmonic bottom-up filters and top-down expectations. The model also clarifies how spatial location cues can help to disambiguate two sources with similar spectral cures. Data are simulated from psychophysical grouping experiments, such as how a tone sweeping upwards in frequency creates a bounce percept by grouping with a downward sweeping tone due to proximity in frequency, even if noise replaces the tones at their interection point. Illusory auditory percepts are also simulated, such as the auditory continuity illusion of a tone continuing through a noise burst even if the tone is not present during the noise, and the scale illusion of Deutsch whereby downward and upward scales presented alternately to the two ears are regrouped based on frequency proximity, leading to a bounce percept. Since related sorts of resonances have been used to quantitatively simulate psychophysical data about speech perception, the model strengthens the hypothesis the ART-like mechanisms are used at multiple levels of the auditory system. Proposals for developing the model to explain more complex streaming data are also provided.Air Force Office of Scientific Research (F49620-01-1-0397, F49620-92-J-0225); Office of Naval Research (N00014-01-1-0624); Advanced Research Projects Agency (N00014-92-J-4015); British Petroleum (89A-1204); National Science Foundation (IRI-90-00530); American Society of Engineering Educatio

    ARSTREAM: A Neural Network Model of Auditory Scene Analysis and Source Segregation

    Full text link
    Multiple sound sources often contain harmonics that overlap and may be degraded by environmental noise. The auditory system is capable of teasing apart these sources into distinct mental objects, or streams. Such an "auditory scene analysis" enables the brain to solve the cocktail party problem. A neural network model of auditory scene analysis, called the AIRSTREAM model, is presented to propose how the brain accomplishes this feat. The model clarifies how the frequency components that correspond to a give acoustic source may be coherently grouped together into distinct streams based on pitch and spatial cues. The model also clarifies how multiple streams may be distinguishes and seperated by the brain. Streams are formed as spectral-pitch resonances that emerge through feedback interactions between frequency-specific spectral representaion of a sound source and its pitch. First, the model transforms a sound into a spatial pattern of frequency-specific activation across a spectral stream layer. The sound has multiple parallel representations at this layer. A sound's spectral representation activates a bottom-up filter that is sensitive to harmonics of the sound's pitch. The filter activates a pitch category which, in turn, activate a top-down expectation that allows one voice or instrument to be tracked through a noisy multiple source environment. Spectral components are suppressed if they do not match harmonics of the top-down expectation that is read-out by the selected pitch, thereby allowing another stream to capture these components, as in the "old-plus-new-heuristic" of Bregman. Multiple simultaneously occuring spectral-pitch resonances can hereby emerge. These resonance and matching mechanisms are specialized versions of Adaptive Resonance Theory, or ART, which clarifies how pitch representations can self-organize durin learning of harmonic bottom-up filters and top-down expectations. The model also clarifies how spatial location cues can help to disambiguate two sources with similar spectral cures. Data are simulated from psychophysical grouping experiments, such as how a tone sweeping upwards in frequency creates a bounce percept by grouping with a downward sweeping tone due to proximity in frequency, even if noise replaces the tones at their interection point. Illusory auditory percepts are also simulated, such as the auditory continuity illusion of a tone continuing through a noise burst even if the tone is not present during the noise, and the scale illusion of Deutsch whereby downward and upward scales presented alternately to the two ears are regrouped based on frequency proximity, leading to a bounce percept. Since related sorts of resonances have been used to quantitatively simulate psychophysical data about speech perception, the model strengthens the hypothesis the ART-like mechanisms are used at multiple levels of the auditory system. Proposals for developing the model to explain more complex streaming data are also provided.Air Force Office of Scientific Research (F49620-01-1-0397, F49620-92-J-0225); Office of Naval Research (N00014-01-1-0624); Advanced Research Projects Agency (N00014-92-J-4015); British Petroleum (89A-1204); National Science Foundation (IRI-90-00530); American Society of Engineering Educatio

    Investigating the effect of visual phonetic cues on the auditory N1 & P2

    Get PDF
    Studies have shown that the N1 and P2 auditory event-related potentials (ERPs) that occur to a speech sound when the talker can be seen (i.e., Auditory-Visual speech), occur earlier and are reduced in amplitude compared to when the talker cannot be seen (auditory-only speech). An explanation for why seeing the talker changes the brain’s response to sound is that visual speech provides information about the upcoming auditory speech event. This information reduces uncertainty about when the sound will occur and about what the event will be (resulting in a smaller N1 and P2, which are markers associated with auditory processing). It has yet to be determined whether form information alone can influence the amplitude or timing of either the N1 or P2. We tested this by conducting two separate EEG experiments. In Experiment 1, we compared the N1 and P2 peaks of the ERPs to auditory speech when preceded by a visual speech cue (Audio-visual Speech) or by a static neutral face. In Experiment 2, we compared contrasting N1/P2 peaks of the ERPs to auditory speech preceded by print cues presenting reliable information about their content (written “ba” or “da” shown before these spoken syllables), or to control cues (meaningless printed symbols). The results of Experiment 1 confirmed that the presentation of visual speech produced the expected effect of amplitude suppression of the N1 but the opposite effect occurred for latency facilitation (Auditory-only speech faster than Audio-visual speech). For Experiment 2, no difference in the amplitude or timing of the N1 or P2 ERPs to the reliable print versus the control cues was found. The unexpected slower latency response of the N1 to AV speech stimuli found in Experiment 1, may be accounted for by attentional differences induced by the experimental design. The null effect of print cues in Experiment 2 indicate the importance of the temporal relationship between visual and auditory events

    Embodied & Situated Language Processing

    Get PDF

    Sound for Fantasy and Freedom

    Get PDF
    Sound is an integral part of our everyday lives. Sound tells us about physical events in the environ- ment, and we use our voices to share ideas and emotions through sound. When navigating the world on a day-to-day basis, most of us use a balanced mix of stimuli from our eyes, ears and other senses to get along. We do this totally naturally and without effort. In the design of computer game experiences, traditionally, most attention has been given to vision rather than the balanced mix of stimuli from our eyes, ears and other senses most of us use to navigate the world on a day to day basis. The risk is that this emphasis neglects types of interaction with the game needed to create an immersive experience. This chapter summarizes the relationship between sound properties, GameFlow and immersive experience and discusses two projects in which Interactive Institute, Sonic Studio has balanced perceptual stimuli and game mechanics to inspire and create new game concepts that liberate users and their imagination
    corecore