1,731 research outputs found

    Exploiting correlogram structure for robust speech recognition with multiple speech sources

    Get PDF
    This paper addresses the problem of separating and recognising speech in a monaural acoustic mixture with the presence of competing speech sources. The proposed system treats sound source separation and speech recognition as tightly coupled processes. In the first stage sound source separation is performed in the correlogram domain. For periodic sounds, the correlogram exhibits symmetric tree-like structures whose stems are located on the delay that corresponds to multiple pitch periods. These pitch-related structures are exploited in the study to group spectral components at each time frame. Local pitch estimates are then computed for each spectral group and are used to form simultaneous pitch tracks for temporal integration. These processes segregate a spectral representation of the acoustic mixture into several time-frequency regions such that the energy in each region is likely to have originated from a single periodic sound source. The identified time-frequency regions, together with the spectral representation, are employed by a `speech fragment decoder' which employs `missing data' techniques with clean speech models to simultaneously search for the acoustic evidence that best matches model sequences. The paper presents evaluations based on artificially mixed simultaneous speech utterances. A coherence-measuring experiment is first reported which quantifies the consistency of the identified fragments with a single source. The system is then evaluated in a speech recognition task and compared to a conventional fragment generation approach. Results show that the proposed system produces more coherent fragments over different conditions, which results in significantly better recognition accuracy

    Source separation with one ear : proposition for an anthropomorphic approach

    Get PDF
    Abstract : We present an example of an anthropomorphic approach, in which auditory-based cues are combined with temporal correlation to implement a source separation system. The auditory features are based on spectral amplitudemodulation and energy information obtained through 256 cochlear filters. Segmentation and binding of auditory objects are performed with a two-layered spiking neural network. The first layer performs the segmentation of the auditory images into objects, while the second layer binds the auditory objects belonging to the same source. The binding is further used to generate a mask (binary gain) to suppress the undesired sources fromthe original signal. Results are presented for a double-voiced (2 speakers) speech segment and for sentences corrupted with different noise sources. Comparative results are also given using PESQ (perceptual evaluation of speech quality) scores. The spiking neural network is fully adaptive and unsupervised

    Contributions of local speech encoding and functional connectivity to audio-visual speech perception

    Get PDF
    Seeing a speaker’s face enhances speech intelligibility in adverse environments. We investigated the underlying network mechanisms by quantifying local speech representations and directed connectivity in MEG data obtained while human participants listened to speech of varying acoustic SNR and visual context. During high acoustic SNR speech encoding by temporally entrained brain activity was strong in temporal and inferior frontal cortex, while during low SNR strong entrainment emerged in premotor and superior frontal cortex. These changes in local encoding were accompanied by changes in directed connectivity along the ventral stream and the auditory-premotor axis. Importantly, the behavioral benefit arising from seeing the speaker’s face was not predicted by changes in local encoding but rather by enhanced functional connectivity between temporal and inferior frontal cortex. Our results demonstrate a role of auditory-frontal interactions in visual speech representations and suggest that functional connectivity along the ventral pathway facilitates speech comprehension in multisensory environments

    Audio-coupled video content understanding of unconstrained video sequences

    Get PDF
    Unconstrained video understanding is a difficult task. The main aim of this thesis is to recognise the nature of objects, activities and environment in a given video clip using both audio and video information. Traditionally, audio and video information has not been applied together for solving such complex task, and for the first time we propose, develop, implement and test a new framework of multi-modal (audio and video) data analysis for context understanding and labelling of unconstrained videos. The framework relies on feature selection techniques and introduces a novel algorithm (PCFS) that is faster than the well-established SFFS algorithm. We use the framework for studying the benefits of combining audio and video information in a number of different problems. We begin by developing two independent content recognition modules. The first one is based on image sequence analysis alone, and uses a range of colour, shape, texture and statistical features from image regions with a trained classifier to recognise the identity of objects, activities and environment present. The second module uses audio information only, and recognises activities and environment. Both of these approaches are preceded by detailed pre-processing to ensure that correct video segments containing both audio and video content are present, and that the developed system can be made robust to changes in camera movement, illumination, random object behaviour etc. For both audio and video analysis, we use a hierarchical approach of multi-stage classification such that difficult classification tasks can be decomposed into simpler and smaller tasks. When combining both modalities, we compare fusion techniques at different levels of integration and propose a novel algorithm that combines advantages of both feature and decision-level fusion. The analysis is evaluated on a large amount of test data comprising unconstrained videos collected for this work. We finally, propose a decision correction algorithm which shows that further steps towards combining multi-modal classification information effectively with semantic knowledge generates the best possible results

    Neural oscillatory signatures of auditory and audiovisual illusions

    Get PDF
    Questions of the relationship between human perception and brain activity can be approached from different perspectives: in the first, the brain is mainly regarded as a recipient and processor of sensory data. The corresponding research objective is to establish mappings of neural activity patterns and external stimuli. Alternatively, the brain can be regarded as a self-organized dynamical system, whose constantly changing state affects how incoming sensory signals are processed and perceived. The research reported in this thesis can chiefly be located in the second framework, and investigates the relationship between oscillatory brain activity and the perception of ambiguous stimuli. Oscillations are here considered as a mechanism for the formation of transient neural assemblies, which allows efficient information transfer. While the relevance of activity in distinct frequency bands for auditory and audiovisual perception is well established, different functional architectures of sensory integration can be derived from the literature. This dissertation therefore aims to further clarify the role of oscillatory activity in the integration of sensory signals towards unified perceptual objects, using illusion paradigms as tools of study. In study 1, we investigate the role of low frequency power modulations and phase alignment in auditory object formation. We provide evidence that auditory restoration is associated with a power reduction, while the registration of an additional object is reflected by an increase in phase locking. In study 2, we analyze oscillatory power as a predictor of auditory influence on visual perception in the sound-induced flash illusion. We find that increased beta-/ gamma-band power over occipitotemporal electrodes shortly before stimulus onset predicts the illusion, suggesting a facilitation of processing in polymodal circuits. In study 3, we address the question of whether visual influence on auditory perception in the ventriloquist illusion is reflected in primary sensory or higher-order areas. We establish an association between reduced theta-band power in mediofrontal areas and the occurrence of illusion, which indicates a top-down influence on sensory decision-making. These findings broaden our understanding of the functional relevance of neural oscillations by showing that different processing modes, which are reflected in specific spatiotemporal activity patterns, operate in different instances of sensory integration.Fragen nach dem Zusammenhang zwischen menschlicher Wahrnehmung und Hirnaktivität können aus verschiedenen Perspektiven adressiert werden: in der einen wird das Gehirn hauptsächlich als Empfänger und Verarbeiter von sensorischen Daten angesehen. Das entsprechende Forschungsziel wäre eine Zuordnung von neuronalen Aktivitätsmustern zu externen Reizen. Dieser Sichtweise gegenüber steht ein Ansatz, der das Gehirn als selbstorganisiertes dynamisches System begreift, dessen sich ständig verändernder Zustand die Verarbeitung und Wahrnehmung von sensorischen Signalen beeinflusst. Die Arbeiten, die in dieser Dissertation zusammengefasst sind, können vor allem in der zweitgenannten Forschungsrichtung verortet werden, und untersuchen den Zusammenhang zwischen oszillatorischer Hirnaktivität und der Wahrnehmung von mehrdeutigen Stimuli. Oszillationen werden hier als ein Mechanismus für die Formation von transienten neuronalen Zusammenschlüssen angesehen, der effizienten Informationstransfer ermöglicht. Obwohl die Relevanz von Aktivität in verschiedenen Frequenzbändern für auditorische und audiovisuelle Wahrnehmung gut belegt ist, können verschiedene funktionelle Architekturen der sensorischen Integration aus der Literatur abgeleitet werden. Das Ziel dieser Dissertation ist deshalb eine Präzisierung der Rolle oszillatorischer Aktivität bei der Integration von sensorischen Signalen zu einheitlichen Wahrnehmungsobjekten mittels der Nutzung von Illusionsparadigmen. In der ersten Studie untersuchen wir die Rolle von Leistung und Phasenanpassung in niedrigen Frequenzbändern bei der Formation von auditorischen Objekten. Wir zeigen, dass die Wiederherstellung von Tönen mit einer Reduktion der Leistung zusammenhängt, während die Registrierung eines zusätzlichen Objekts durch einen erhöhten Phasenangleich widergespiegelt wird. In der zweiten Studie analysieren wir oszillatorische Leistung als Prädiktor von auditorischem Einfluss auf visuelle Wahrnehmung in der sound-induced flash illusion. Wir stellen fest, dass erhöhte Beta-/Gamma-Band Leistung über occipitotemporalen Elektroden kurz vor der Reizdarbietung das Auftreten der Illusion vorhersagt, was auf eine Begünstigung der Verarbeitung in polymodalen Arealen hinweist. In der dritten Studie widmen wir uns der Frage, ob ein visueller Einfluss auf auditorische Wahrnehmung in der ventriloquist illusion sich in primären sensorischen oder übergeordneten Arealen widerspiegelt. Wir weisen einen Zusammenhang von reduzierter Theta-Band Leistung in mediofrontalen Arealen und dem Auftreten der Illusion nach, was einen top-down Einfluss auf sensorische Entscheidungsprozesse anzeigt. Diese Befunde erweitern unser Verständnis der funktionellen Bedeutung neuronaler Oszillationen, indem sie aufzeigen, dass verschiedene Verarbeitungsmodi, die sich in spezifischen räumlich-zeitlichen Aktivitätsmustern spiegeln, in verschiedenen Phänomenen von sensorischer Integration wirksam sind

    The uncoupling limit of identical Hopf bifurcations with an application to perceptual bistability

    Get PDF
    We study the dynamics arising when two identical oscillators are coupled near a Hopf bifurcation where we assume a parameter ϵ\epsilon uncouples the system at ϵ=0\epsilon=0. Using a normal form for N=2N=2 identical systems undergoing Hopf bifurcation, we explore the dynamical properties. Matching the normal form coefficients to a coupled Wilson-Cowan oscillator network gives an understanding of different types of behaviour that arise in a model of perceptual bistability. Notably, we find bistability between in-phase and anti-phase solutions that demonstrates the feasibility for synchronisation to act as the mechanism by which periodic inputs can be segregated (rather than via strong inhibitory coupling, as in existing models). Using numerical continuation we confirm our theoretical analysis for small coupling strength and explore the bifurcation diagrams for large coupling strength, where the normal form approximation breaks down

    Neural signatures of the processing of temporal patterns in sound

    Get PDF
    The ability to detect regularities in sound (i.e., recurring structure) is critical for effective perception, enabling, for example, change detection and prediction. Two seemingly unconnected lines of research concern the neural operations involved in processing regularities: one investigates how neural activity synchronizes with temporal regularities (e.g., frequency modulation; FM) in sounds, whereas the other focuses on increases in sustained activity during stimulation with repeating tone-frequency patterns. In three electroencephalography studies with male and female human participants, we investigated whether neural synchronization and sustained neural activity are dissociable, or whether they are functionally interdependent. Experiment I demonstrated that neural activity synchronizes with temporal regularity (FM) in sounds, and that sustained activity increases concomitantly. In Experiment II, phase coherence of FM in sounds was parametrically varied. Although neural synchronization was more sensitive to changes in FM coherence, such changes led to a systematic modulation of both neural synchronization and sustained activity, with magnitude increasing as coherence increased. In Experiment III, participants either performed a duration categorization task on the sounds, or a visual object tracking task to distract attention. Neural synchronization was observed regardless of task, whereas the sustained response was observed only when attention was on the auditory task, not under (visual) distraction. The results suggest that neural synchronization and sustained activity levels are functionally linked: both are sensitive to regularities in sounds. However, neural synchronization might reflect a more sensory-driven response to regularity, compared with sustained activity which may be influenced by attentional, contextual, or other experiential factors

    Sound Source Separation

    Get PDF
    This is the author's accepted pre-print of the article, first published as G. Evangelista, S. Marchand, M. D. Plumbley and E. Vincent. Sound source separation. In U. Zölzer (ed.), DAFX: Digital Audio Effects, 2nd edition, Chapter 14, pp. 551-588. John Wiley & Sons, March 2011. ISBN 9781119991298. DOI: 10.1002/9781119991298.ch14file: Proof:e\EvangelistaMarchandPlumbleyV11-sound.pdf:PDF owner: markp timestamp: 2011.04.26file: Proof:e\EvangelistaMarchandPlumbleyV11-sound.pdf:PDF owner: markp timestamp: 2011.04.2

    Change blindness: eradication of gestalt strategies

    Get PDF
    Arrays of eight, texture-defined rectangles were used as stimuli in a one-shot change blindness (CB) task where there was a 50% chance that one rectangle would change orientation between two successive presentations separated by an interval. CB was eliminated by cueing the target rectangle in the first stimulus, reduced by cueing in the interval and unaffected by cueing in the second presentation. This supports the idea that a representation was formed that persisted through the interval before being 'overwritten' by the second presentation (Landman et al, 2003 Vision Research 43149–164]. Another possibility is that participants used some kind of grouping or Gestalt strategy. To test this we changed the spatial position of the rectangles in the second presentation by shifting them along imaginary spokes (by ±1 degree) emanating from the central fixation point. There was no significant difference seen in performance between this and the standard task [F(1,4)=2.565, p=0.185]. This may suggest two things: (i) Gestalt grouping is not used as a strategy in these tasks, and (ii) it gives further weight to the argument that objects may be stored and retrieved from a pre-attentional store during this task
    • …
    corecore