15 research outputs found
Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision
We tackle the problem of audiovisual scene analysis for weakly-labeled data.
To this end, we build upon our previous audiovisual representation learning
framework to perform object classification in noisy acoustic environments and
integrate audio source enhancement capability. This is made possible by a novel
use of non-negative matrix factorization for the audio modality. Our approach
is founded on the multiple instance learning paradigm. Its effectiveness is
established through experiments over a challenging dataset of music instrument
performance videos. We also show encouraging visual object localization
results
Audiovisual Transformer Architectures for Large-Scale Classification and Synchronization of Weakly Labeled Audio Events
We tackle the task of environmental event classification by drawing
inspiration from the transformer neural network architecture used in machine
translation. We modify this attention-based feedforward structure in such a way
that allows the resulting model to use audio as well as video to compute sound
event predictions. We perform extensive experiments with these adapted
transformers on an audiovisual data set, obtained by appending relevant visual
information to an existing large-scale weakly labeled audio collection. The
employed multi-label data contains clip-level annotation indicating the
presence or absence of 17 classes of environmental sounds, and does not include
temporal information. We show that the proposed modified transformers strongly
improve upon previously introduced models and in fact achieve state-of-the-art
results. We also make a compelling case for devoting more attention to research
in multimodal audiovisual classification by proving the usefulness of visual
information for the task at hand,namely audio event recognition. In addition,
we visualize internal attention patterns of the audiovisual transformers and in
doing so demonstrate their potential for performing multimodal synchronization
Apprentissage de représentations pour l'analyse robuste de scènes audiovisuelles
The goal of this thesis is to design algorithms that enable robust detection of objectsand events in videos through joint audio-visual analysis. This is motivated by humans’remarkable ability to meaningfully integrate auditory and visual characteristics forperception in noisy scenarios. To this end, we identify two kinds of natural associationsbetween the modalities in recordings made using a single microphone and camera,namely motion-audio correlation and appearance-audio co-occurrence.For the former, we use audio source separation as the primary application andpropose two novel methods within the popular non-negative matrix factorizationframework. The central idea is to utilize the temporal correlation between audio andmotion for objects/actions where the sound-producing motion is visible. The firstproposed method focuses on soft coupling between audio and motion representationscapturing temporal variations, while the second is based on cross-modal regression.We segregate several challenging audio mixtures of string instruments into theirconstituent sources using these approaches.To identify and extract many commonly encountered objects, we leverageappearance–audio co-occurrence in large datasets. This complementary associationmechanism is particularly useful for objects where motion-based correlations are notvisible or available. The problem is dealt with in a weakly-supervised setting whereinwe design a representation learning framework for robust AV event classification,visual object localization, audio event detection and source separation.We extensively test the proposed ideas on publicly available datasets. The experimentsdemonstrate several intuitive multimodal phenomena that humans utilize on aregular basis for robust scene understanding.L'objectif de cette thèse est de concevoir des algorithmes qui permettent la détection robuste d’objets et d’événements dans des vidéos en s’appuyant sur une analyse conjointe de données audio et visuelle. Ceci est inspiré par la capacité remarquable des humains à intégrer les caractéristiques auditives et visuelles pour améliorer leur compréhension de scénarios bruités. À cette fin, nous nous appuyons sur deux types d'associations naturelles entre les modalités d'enregistrements audiovisuels (réalisés à l'aide d'un seul microphone et d'une seule caméra), à savoir la corrélation mouvement/audio et la co-occurrence apparence/audio. Dans le premier cas, nous utilisons la séparation de sources audio comme application principale et proposons deux nouvelles méthodes dans le cadre classique de la factorisation par matrices non négatives (NMF). L'idée centrale est d'utiliser la corrélation temporelle entre l'audio et le mouvement pour les objets / actions où le mouvement produisant le son est visible. La première méthode proposée met l'accent sur le couplage flexible entre les représentations audio et de mouvement capturant les variations temporelles, tandis que la seconde repose sur la régression intermodale. Nous avons séparé plusieurs mélanges complexes d'instruments à cordes en leurs sources constituantes en utilisant ces approches.Pour identifier et extraire de nombreux objets couramment rencontrés, nous exploitons la co-occurrence apparence/audio dans de grands ensembles de données. Ce mécanisme d'association complémentaire est particulièrement utile pour les objets où les corrélations basées sur le mouvement ne sont ni visibles ni disponibles. Le problème est traité dans un contexte faiblement supervisé dans lequel nous proposons un framework d’apprentissage de représentation pour la classification robuste des événements audiovisuels, la localisation des objets visuels, la détection des événements audio et la séparation de sources.Nous avons testé de manière approfondie les idées proposées sur des ensembles de données publics. Ces expériences permettent de faire un lien avec des phénomènes intuitifs et multimodaux que les humains utilisent dans leur processus de compréhension de scènes audiovisuelles
Improving audio retrieval through loudness profile categorization
Comunicació presentada al 2016 IEEE International Symposium on Multimedia, celebrat els dies 11 a 13 de desembre de 2016 a San José, Califòrnia.The increasing popularity of audio content sharing in online platforms requires the development of techniques to better organize and retrieve this data. In this paper we look at how to improve similarity search through content categorization in the context of Freesound, a popular online sound sharing site. We focus on organization based on morphological description. In particular, we propose to improve search results by incorporating information about query sound's loudness profile. This is performed within a thresholding based framework and can be generalized to structure information about the temporal evolution of other sound attributes. We perform a subjective evaluation to demonstrate the practical relevance of our method
Improving audio retrieval through loudness profile categorization
Comunicació presentada al 2016 IEEE International Symposium on Multimedia, celebrat els dies 11 a 13 de desembre de 2016 a San José, Califòrnia.The increasing popularity of audio content sharing in online platforms requires the development of techniques to better organize and retrieve this data. In this paper we look at how to improve similarity search through content categorization in the context of Freesound, a popular online sound sharing site. We focus on organization based on morphological description. In particular, we propose to improve search results by incorporating information about query sound's loudness profile. This is performed within a thresholding based framework and can be generalized to structure information about the temporal evolution of other sound attributes. We perform a subjective evaluation to demonstrate the practical relevance of our method
Continuous emotion transfer using kernels
Style transfer is a central problem of machine learning with numerous successful applications. In this work, we present a novel style transfer framework building upon infinite task learning and vector-valued reproducing kernel Hilbert spaces. We consider style transfer as a functional output regression task where the goal is to transform the input objects to a continuum of styles. The learnt mapping is governed by the choice of two kernels, one on the object space and one on the style space, providing flexibility to the approach. We instantiate the idea in emotion transfer where facial landmarks play the role of objects and styles correspond to emotions. The proposed approach provides a principled way to gain explicit control over the continuous style space, allowing to transform landmarks to emotions not seen during the training phase. We demonstrate the efficiency of the technique on popular facial emotion benchmarks, achieving low reconstruction cos
Listen to Interpret: Post-hoc Interpretability for Audio Networks with NMF
This paper tackles post-hoc interpretability for audio processing networks. Our goal is to interpret decisions of a network in terms of high-level audio objects that are also listenable for the end-user. To this end, we propose a novel interpreter design that incorporates non-negative matrix factorization (NMF). In particular, a carefully regularized interpreter module is trained to take hidden layer representations of the targeted network as input and produce time activations of pre-learnt NMF components as intermediate outputs. Our methodology allows us to generate intuitive audio-based interpretations that explicitly enhance parts of the input signal most relevant for a network’s decision. We demonstrate our method’s applicability on popular benchmarks, including a real-world multi-label classification task