9 research outputs found

    Visually Guided Sound Source Separation using Cascaded Opponent Filter Network

    Get PDF
    The objective of this paper is to recover the original component signals from a mixture audio with the aid of visual cues of the sound sources. Such task is usually referred as visually guided sound source separation. The proposed Cascaded Opponent Filter (COF) framework consists of multiple stages, which recursively refine the source separation. A key element in COF is a novel opponent filter module that identifies and relocates residual components between sources. The system is guided by the appearance and motion of the source, and, for this purpose, we study different representations based on video frames, optical flows, dynamic images, and their combinations. Finally, we propose a Sound Source Location Masking (SSLM) technique, which, together with COF, produces a pixel level mask of the source location. The entire system is trained end-to-end using a large set of unlabelled videos. We compare COF with recent baselines and obtain the state-of-the-art performance in three challenging datasets (MUSIC, A-MUSIC, and A-NATURAL). Project page: https://ly-zhu.github.io/cof-net.Comment: main paper 14 pages, ref 3 pages, and supp 7 pages. Revised argument in section 3 and

    Visually Guided Sound Source Separation Using Cascaded Opponent Filter Network

    Get PDF
    The objective of this paper is to recover the original component signals from a mixture audio with the aid of visual cues of the sound sources. Such task is usually referred as visually guided sound source separation. The proposed Cascaded Opponent Filter (COF) framework consists of multiple stages, which recursively refine the source separation. A key element in COF is a novel opponent filter module that identifies and relocates residual components between sources. The system is guided by the appearance and motion of the source, and, for this purpose, we study different representations based on video frames, optical flows, dynamic images, and their combinations. Finally, we propose a Sound Source Location Masking (SSLM) technique, which, together with COF, produces a pixel level mask of the source location. The entire system is trained in an end-to-end manner using a large set of unlabelled videos. We compare COF with recent baselines and obtain the state-of-the-art performance in three challenging datasets (MUSIC, A-MUSIC, and A-NATURAL).acceptedVersionPeer reviewe

    Stabilizing and Enhancing Learning for Deep Complex and Real Neural Networks

    Get PDF
    Dans cette thĂšse nous proposons un ensemble de contributions originales sous la forme de trois articles relatifs aux rĂ©seaux de neurones profonds rĂ©els et complexes. Nous abordons Ă  la fois des problĂšmes thĂ©oriques et pratiques liĂ©s Ă  leur apprentissage. Les trois articles traitent des mĂ©thodes conçues pour apporter des solutions aux problĂšmes de l’instabilitĂ© observĂ©e au cours de l’entrainement des rĂ©seaux, notamment le problĂšme notoire de dilution et d’explosion des gradients ou «vanishing and exploding gradients » lors de l’entrainement des rĂ©seaux de neurones profonds. Nous proposons dans un premier temps la conception de modules d’entrainement appropriĂ©s, dĂ©signĂ©s par «building blocks», pour les rĂ©seaux de neurones profonds Ă  valeurs complexes. Notre proposition comporte des mĂ©thodes d’initialisation et de normalisation ainsi que des fonctions d’activation des unitĂ©s neuronales. Les modules conçus sont par la suite utilisĂ©s pour la spĂ©cification d’architectures profondes Ă  valeurs complexes dĂ©diĂ©es Ă  accomplir diverses tĂąches. Ceci comprend des tĂąches de vision par ordinateur, de transcription musicale, de prĂ©diction du spectre de la parole, d’extraction des signaux et de sĂ©paration des sources audio. Finalement nous procĂ©dons Ă  une analyse dĂ©taillĂ©e de l’utilitĂ© de l’hypothĂšse contraignante d’orthogonalitĂ© gĂ©nĂ©ralement adoptĂ©e pour le paramĂ©trage de la matrice de transition Ă  travers les couches des rĂ©seaux de neurones rĂ©els rĂ©currents.----------ABSTRACT : This thesis presents a set of original contributions in the form of three chapters on real and complex-valued deep neural networks. We address both theoretical issues and practical challenges related to the training of both real and complex-valued neural networks. First, we investigate the design of appropriate building blocks for deep complex-valued neural networks, such as initialization methods, normalization techniques and elementwise activation functions. We apply our theoretical insights to design building blocks for the construction of deep complex-valued architectures. We use them to perform various tasks in computer vision, music transcription, speech spectrum prediction, signal retrieval and audio source separation. We also perform an analysis of the usefulness of orthogonality for the hidden transition matrix in a real-valued recurrent neural network. Each of the three chapters are dedicated to dealing with methods designed to provide solutions to problems causing training instability, among them, the notorious problem of vanishing and exploding gradients during the training of deep neural networks. Throughout this manuscript we show the usefulness of the methods we propose in the context of well known challenges and clearly identifiable objectives. We provide below a summary of the contributions within each chapter. At present, the vast majority of building blocks, techniques, and architectures for training deep neural networks are based on real-valued computations and representations. However, representations based on complex numbers have started to receive increased attention. Despite their compelling properties complex-valued deep neural networks have been neglected due in part to the absence of the building blocks required to design and train this type of network. The lack of such a framework represents a noticeable gap in deep learning tooling

    NeuroHeed: Neuro-Steered Speaker Extraction using EEG Signals

    Full text link
    Humans possess the remarkable ability to selectively attend to a single speaker amidst competing voices and background noise, known as selective auditory attention. Recent studies in auditory neuroscience indicate a strong correlation between the attended speech signal and the corresponding brain's elicited neuronal activities, which the latter can be measured using affordable and non-intrusive electroencephalography (EEG) devices. In this study, we present NeuroHeed, a speaker extraction model that leverages EEG signals to establish a neuronal attractor which is temporally associated with the speech stimulus, facilitating the extraction of the attended speech signal in a cocktail party scenario. We propose both an offline and an online NeuroHeed, with the latter designed for real-time inference. In the online NeuroHeed, we additionally propose an autoregressive speaker encoder, which accumulates past extracted speech signals for self-enrollment of the attended speaker information into an auditory attractor, that retains the attentional momentum over time. Online NeuroHeed extracts the current window of the speech signals with guidance from both attractors. Experimental results demonstrate that NeuroHeed effectively extracts brain-attended speech signals, achieving high signal quality, excellent perceptual quality, and intelligibility in a two-speaker scenario

    Proceedings of the 19th Sound and Music Computing Conference

    Get PDF
    Proceedings of the 19th Sound and Music Computing Conference - June 5-12, 2022 - Saint-Étienne (France). https://smc22.grame.f
    corecore