Search CORE

1,557 research outputs found

Exploiting correlogram structure for robust speech recognition with multiple speech sources

Author: Barker J.
Coy A.
Green P.
Ma N.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2007
Field of study

This paper addresses the problem of separating and recognising speech in a monaural acoustic mixture with the presence of competing speech sources. The proposed system treats sound source separation and speech recognition as tightly coupled processes. In the first stage sound source separation is performed in the correlogram domain. For periodic sounds, the correlogram exhibits symmetric tree-like structures whose stems are located on the delay that corresponds to multiple pitch periods. These pitch-related structures are exploited in the study to group spectral components at each time frame. Local pitch estimates are then computed for each spectral group and are used to form simultaneous pitch tracks for temporal integration. These processes segregate a spectral representation of the acoustic mixture into several time-frequency regions such that the energy in each region is likely to have originated from a single periodic sound source. The identified time-frequency regions, together with the spectral representation, are employed by a `speech fragment decoder' which employs `missing data' techniques with clean speech models to simultaneously search for the acoustic evidence that best matches model sequences. The paper presents evaluations based on artificially mixed simultaneous speech utterances. A coherence-measuring experiment is first reported which quantifies the consistency of the identified fragments with a single source. The system is then evaluated in a speech recognition task and compared to a conventional fragment generation approach. Results show that the proposed system produces more coherent fragments over different conditions, which results in significantly better recognition accuracy

CiteSeerX

Modulation-frequency acts as a primary cue for auditory stream segregation

Author: Bendixen Alexandra
Denham Susan L.
Szalárdy Orsolya
Tóth Dénes
Winkler István
Publication venue: 'Akademiai Kiado Zrt.'
Publication date: 19/06/2013
Field of study

In our surrounding acoustic world sounds are produced by different sources and interfere with each other before arriving to the ears. A key function of the auditory system is to provide consistent and robust descriptions of the coherent sound groupings and sequences (auditory objects), which likely correspond to the various sound sources in the environment. This function has been termed auditory stream segregation. In the current study we tested the effects of separation in the frequency of amplitude modulation on the segregation of concurrent sound sequences in the auditory stream-segregation paradigm (van Noorden 1975). The aim of the study was to assess 1) whether differential amplitude modulation would help in separating concurrent sound sequences and 2) whether this cue would interact with previously studied static cues (carrier frequency and location difference) in segregating concurrent streams of sound. We found that amplitude modulation difference is utilized as a primary cue for the stream segregation and it interacts with other primary cues such as frequency and location difference

Computational Models of Auditory Scene Analysis: A Review

Author: Akram
Akram
Alain
Alain
Alain
Andreou
Andreou
Bar
Barascud
Barniv
Bee
Bee
Bendixen
Bendixen
Bey
Beáta T. Szabó
Boes
Bregman
Bregman
Carlyon
Ciocca
Cooke
Cusack
Darwin
De Coensel
Deike
Deike
Denham
Denham
Denham
Denham
Ding
Dowling
Duifhuis
Elhilali
Elhilali
Elhilali
Erber
Farkas
Farkas
Fishman
Fishman
Friston
Gibson
Goswami
Gregory
Griffiths
Guinan
Gutschalk
Gutschalk
Hartmann
Haykin
Helfer
Helmholtz
Hupé
Hupé
Irvine
István Winkler
Kersten
Kidd
Kocsis
Kondo
Kondo
Kramer
Krishnan
Krumbholz
Kubovy
Kumar
Kumar
Köhler
Leaver
Leopold
Lipp
Ma
Mathys
McDermott
McDonald
McGurk
Micheyl
Mill
Mittag
Moore
Moore
Moore
Nix
Näätänen
O'Sullivan
Oldoni
Patterson
Pichevar
Pressnitzer
Rajendran
Rankin
Rasch
Roberts
Schadwinkel
Scholl
Schwartz
Shamma
Shamma
Simon
Snyder
Snyder
Steiger
Stoffregen
Susan L. Denham
Szalárdy
Szalárdy
Teki
Teki
Teki
Thakur
Tougas
Tóth
Ulanovsky
Ulanovsky
van Noorden
Wang
Wang
Wilson
Winkler
Winkler
Winkler
Winkler
Wrigley
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2016
Field of study

Auditory scene analysis (ASA) refers to the process(es) of parsing the complex acoustic input into auditory perceptual objects representing either physical sources or temporal sound patterns, such as melodies, which contributed to the sound waves reaching the ears. A number of new computational models accounting for some of the perceptual phenomena of ASA have been published recently. Here we provide a theoretically motivated review of these computational models, aiming to relate their guiding principles to the central issues of the theoretical framework of ASA. Specifically, we ask how they achieve the grouping and separation of sound elements and whether they implement some form of competition between alternative interpretations of the sound input. We consider the extent to which they include predictive processes, as important current theories suggest that perception is inherently predictive, and also how they have been evaluated. We conclude that current computational models of ASA are fragmentary in the sense that rather than providing general competing interpretations of ASA, they focus on assessing the utility of specific processes (or algorithms) for finding the causes of the complex acoustic signal. This leaves open the possibility for integrating complementary aspects of the models into a more comprehensive theory of ASA

Directory of Open Access Journals

Frontiers - Publisher Connector

Plymouth Electronic Archive and Research Library

A physiologically inspired model of auditory stream segregation based on a temporal coherence analysis

Author: Christiansen Simon Krogholt
Dau Torsten
Jepsen Morten Løve
Publication venue: 'Acoustical Society of America (ASA)'
Publication date: 01/01/2012
Field of study

DISSOCIABLE MECHANISMS OF CONCURRENT SPEECH IDENTIFICATION IN NOISE AT CORTICAL AND SUBCORTICAL LEVELS.

Author: Yellamsetty Anusha
Publication venue: University of Memphis Digital Commons
Publication date: 01/01/2018
Field of study

When two vowels with different fundamental frequencies (F0s) are presented concurrently, listeners often hear two voices producing different vowels on different pitches. Parsing of this simultaneous speech can also be affected by the signal-to-noise ratio (SNR) in the auditory scene. The extraction and interaction of F0 and SNR cues may occur at multiple levels of the auditory system. The major aims of this dissertation are to elucidate the neural mechanisms and time course of concurrent speech perception in clean and in degraded listening conditions and its behavioral correlates. In two complementary experiments, electrical brain activity (EEG) was recorded at cortical (EEG Study #1) and subcortical (FFR Study #2) levels while participants heard double-vowel stimuli whose fundamental frequencies (F0s) differed by zero and four semitones (STs) presented in either clean or noise degraded (+5 dB SNR) conditions. Behaviorally, listeners were more accurate in identifying both vowels for larger F0 separations (i.e., 4ST; with pitch cues), and this F0-benefit was more pronounced at more favorable SNRs. Time-frequency analysis of cortical EEG oscillations (i.e., brain rhythms) revealed a dynamic time course for concurrent speech processing that depended on both extrinsic (SNR) and intrinsic (pitch) acoustic factors. Early high frequency activity reflected pre-perceptual encoding of acoustic features (~200 ms) and the quality (i.e., SNR) of the speech signal (~250-350ms), whereas later-evolving low-frequency rhythms (~400-500ms) reflected post-perceptual, cognitive operations that covaried with listening effort and task demands. Analysis of subcortical responses indicated that while FFRs provided a high-fidelity representation of double vowel stimuli and the spectro-temporal nonlinear properties of the peripheral auditory system. FFR activity largely reflected the neural encoding of stimulus features (exogenous coding) rather than perceptual outcomes, but timbre (F1) could predict the speed in noise conditions. Taken together, results of this dissertation suggest that subcortical auditory processing reflects mostly exogenous (acoustic) feature encoding in stark contrast to cortical activity, which reflects perceptual and cognitive aspects of concurrent speech perception. By studying multiple brain indices underlying an identical task, these studies provide a more comprehensive window into the hierarchy of brain mechanisms and time-course of concurrent speech processing

Neuromorphic model for sound source segregation

Author: Krishnan Lakshmi
Publication venue
Publication date: 01/01/2015
Field of study

While humans can easily segregate and track a speaker's voice in a loud noisy environment, most modern speech recognition systems still perform poorly in loud background noise. The computational principles behind auditory source segregation in humans is not yet fully understood. In this dissertation, we develop a computational model for source segregation inspired by auditory processing in the brain. To support the key principles behind the computational model, we conduct a series of electro-encephalography experiments using both simple tone-based stimuli and more natural speech stimulus. Most source segregation algorithms utilize some form of prior information about the target speaker or use more than one simultaneous recording of the noisy speech mixtures. Other methods develop models on the noise characteristics. Source segregation of simultaneous speech mixtures with a single microphone recording and no knowledge of the target speaker is still a challenge. Using the principle of temporal coherence, we develop a novel computational model that exploits the difference in the temporal evolution of features that belong to different sources to perform unsupervised monaural source segregation. While using no prior information about the target speaker, this method can gracefully incorporate knowledge about the target speaker to further enhance the segregation.Through a series of EEG experiments we collect neurological evidence to support the principle behind the model. Aside from its unusual structure and computational innovations, the proposed model provides testable hypotheses of the physiological mechanisms of the remarkable perceptual ability of humans to segregate acoustic sources, and of its psychophysical manifestations in navigating complex sensory environments. Results from EEG experiments provide further insights into the assumptions behind the model and provide motivation for future single unit studies that can provide more direct evidence for the principle of temporal coherence