9 research outputs found
Visually Guided Sound Source Separation using Cascaded Opponent Filter Network
The objective of this paper is to recover the original component signals from
a mixture audio with the aid of visual cues of the sound sources. Such task is
usually referred as visually guided sound source separation. The proposed
Cascaded Opponent Filter (COF) framework consists of multiple stages, which
recursively refine the source separation. A key element in COF is a novel
opponent filter module that identifies and relocates residual components
between sources. The system is guided by the appearance and motion of the
source, and, for this purpose, we study different representations based on
video frames, optical flows, dynamic images, and their combinations. Finally,
we propose a Sound Source Location Masking (SSLM) technique, which, together
with COF, produces a pixel level mask of the source location. The entire system
is trained end-to-end using a large set of unlabelled videos. We compare COF
with recent baselines and obtain the state-of-the-art performance in three
challenging datasets (MUSIC, A-MUSIC, and A-NATURAL). Project page:
https://ly-zhu.github.io/cof-net.Comment: main paper 14 pages, ref 3 pages, and supp 7 pages. Revised argument
in section 3 and
Visually Guided Sound Source Separation Using Cascaded Opponent Filter Network
The objective of this paper is to recover the original component signals from a mixture audio with the aid of visual cues of the sound sources. Such task is usually referred as visually guided sound source separation. The proposed Cascaded Opponent Filter (COF) framework consists of multiple stages, which recursively refine the source separation. A key element in COF is a novel opponent filter module that identifies and relocates residual components between sources. The system is guided by the appearance and motion of the source, and, for this purpose, we study different representations based on video frames, optical flows, dynamic images, and their combinations. Finally, we propose a Sound Source Location Masking (SSLM) technique, which, together with COF, produces a pixel level mask of the source location. The entire system is trained in an end-to-end manner using a large set of unlabelled videos. We compare COF with recent baselines and obtain the state-of-the-art performance in three challenging datasets (MUSIC, A-MUSIC, and A-NATURAL).acceptedVersionPeer reviewe
Stabilizing and Enhancing Learning for Deep Complex and Real Neural Networks
Dans cette thÚse nous proposons un ensemble de contributions originales sous la forme de trois articles relatifs aux réseaux de neurones profonds réels et complexes. Nous abordons à la fois des problÚmes théoriques et pratiques liés à leur apprentissage. Les trois articles
traitent des mĂ©thodes conçues pour apporter des solutions aux problĂšmes de lâinstabilitĂ© observĂ©e au cours de lâentrainement des rĂ©seaux, notamment le problĂšme notoire de dilution et dâexplosion des gradients ou «vanishing and exploding gradients » lors de lâentrainement des rĂ©seaux de neurones profonds. Nous proposons dans un premier temps la conception de modules dâentrainement appropriĂ©s, dĂ©signĂ©s par «building blocks», pour les rĂ©seaux de neurones profonds Ă valeurs complexes. Notre proposition comporte des mĂ©thodes dâinitialisation et de normalisation ainsi que des fonctions dâactivation des unitĂ©s neuronales. Les modules conçus sont par la suite utilisĂ©s pour la spĂ©cification dâarchitectures profondes Ă valeurs complexes dĂ©diĂ©es Ă accomplir diverses
tĂąches. Ceci comprend des tĂąches de vision par ordinateur, de transcription musicale, de prĂ©diction du spectre de la parole, dâextraction des signaux et de sĂ©paration des sources audio. Finalement nous procĂ©dons Ă une analyse dĂ©taillĂ©e de lâutilitĂ© de lâhypothĂšse contraignante
dâorthogonalitĂ© gĂ©nĂ©ralement adoptĂ©e pour le paramĂ©trage de la matrice de transition Ă travers les couches des rĂ©seaux de neurones rĂ©els rĂ©currents.----------ABSTRACT : This thesis presents a set of original contributions in the form of three chapters on real and complex-valued deep neural networks. We address both theoretical issues and practical
challenges related to the training of both real and complex-valued neural networks. First, we investigate the design of appropriate building blocks for deep complex-valued neural networks, such as initialization methods, normalization techniques and elementwise activation functions. We apply our theoretical insights to design building blocks for the construction of deep complex-valued architectures. We use them to perform various tasks in computer vision, music transcription, speech spectrum prediction, signal retrieval and audio source separation. We also perform an analysis of the usefulness of orthogonality for the hidden transition matrix in a real-valued recurrent neural network. Each of the three chapters are dedicated to dealing with methods designed to provide solutions to problems causing training
instability, among them, the notorious problem of vanishing and exploding gradients during the training of deep neural networks. Throughout this manuscript we show the usefulness of the methods we propose in the context of well known challenges and clearly identifiable objectives. We provide below a summary of the contributions within each chapter.
At present, the vast majority of building blocks, techniques, and architectures for training deep neural networks are based on real-valued computations and representations. However, representations based on complex numbers have started to receive increased attention. Despite
their compelling properties complex-valued deep neural networks have been neglected due in part to the absence of the building blocks required to design and train this type of network. The lack of such a framework represents a noticeable gap in deep learning tooling
NeuroHeed: Neuro-Steered Speaker Extraction using EEG Signals
Humans possess the remarkable ability to selectively attend to a single
speaker amidst competing voices and background noise, known as selective
auditory attention. Recent studies in auditory neuroscience indicate a strong
correlation between the attended speech signal and the corresponding brain's
elicited neuronal activities, which the latter can be measured using affordable
and non-intrusive electroencephalography (EEG) devices. In this study, we
present NeuroHeed, a speaker extraction model that leverages EEG signals to
establish a neuronal attractor which is temporally associated with the speech
stimulus, facilitating the extraction of the attended speech signal in a
cocktail party scenario. We propose both an offline and an online NeuroHeed,
with the latter designed for real-time inference. In the online NeuroHeed, we
additionally propose an autoregressive speaker encoder, which accumulates past
extracted speech signals for self-enrollment of the attended speaker
information into an auditory attractor, that retains the attentional momentum
over time. Online NeuroHeed extracts the current window of the speech signals
with guidance from both attractors. Experimental results demonstrate that
NeuroHeed effectively extracts brain-attended speech signals, achieving high
signal quality, excellent perceptual quality, and intelligibility in a
two-speaker scenario
Proceedings of the 19th Sound and Music Computing Conference
Proceedings of the 19th Sound and Music Computing Conference - June 5-12, 2022 - Saint-Ătienne (France).
https://smc22.grame.f