Search CORE

15 research outputs found

Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision

Author: Duong Ngoc
Essid Slim
Ozerov Alexey
Parekh Sanjeel
Pérez Patrick
Richard Gaël
Publication venue
Publication date: 07/11/2018
Field of study

We tackle the problem of audiovisual scene analysis for weakly-labeled data. To this end, we build upon our previous audiovisual representation learning framework to perform object classification in noisy acoustic environments and integrate audio source enhancement capability. This is made possible by a novel use of non-negative matrix factorization for the audio modality. Our approach is founded on the multiple instance learning paradigm. Its effectiveness is established through experiments over a challenging dataset of music instrument performance videos. We also show encouraging visual object localization results

arXiv.org e-Print Archive

HAL-Rennes 1

Audiovisual Transformer Architectures for Large-Scale Classification and Synchronization of Weakly Labeled Audio Events

Author: Abadi Mart'in
Gehring Jonas
Hershey Shawn
Iqbal Turab
Mesaros Annamaria
Mesaros Annamaria
Parekh Sanjeel
Plumbley Mark D.
Simonyan Karen
Virtanen Tuomas
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 02/12/2019
Field of study

We tackle the task of environmental event classification by drawing inspiration from the transformer neural network architecture used in machine translation. We modify this attention-based feedforward structure in such a way that allows the resulting model to use audio as well as video to compute sound event predictions. We perform extensive experiments with these adapted transformers on an audiovisual data set, obtained by appending relevant visual information to an existing large-scale weakly labeled audio collection. The employed multi-label data contains clip-level annotation indicating the presence or absence of 17 classes of environmental sounds, and does not include temporal information. We show that the proposed modified transformers strongly improve upon previously introduced models and in fact achieve state-of-the-art results. We also make a compelling case for devoting more attention to research in multimodal audiovisual classification by proving the usefulness of visual information for the task at hand,namely audio event recognition. In addition, we visualize internal attention patterns of the audiovisual transformers and in doing so demonstrate their potential for performing multimodal synchronization

arXiv.org e-Print Archive

Crossref

Apprentissage de représentations pour l'analyse robuste de scènes audiovisuelles

Author: Parekh Sanjeel
Publication venue: HAL CCSD
Publication date: 18/03/2019
Field of study

The goal of this thesis is to design algorithms that enable robust detection of objectsand events in videos through joint audio-visual analysis. This is motivated by humans’remarkable ability to meaningfully integrate auditory and visual characteristics forperception in noisy scenarios. To this end, we identify two kinds of natural associationsbetween the modalities in recordings made using a single microphone and camera,namely motion-audio correlation and appearance-audio co-occurrence.For the former, we use audio source separation as the primary application andpropose two novel methods within the popular non-negative matrix factorizationframework. The central idea is to utilize the temporal correlation between audio andmotion for objects/actions where the sound-producing motion is visible. The firstproposed method focuses on soft coupling between audio and motion representationscapturing temporal variations, while the second is based on cross-modal regression.We segregate several challenging audio mixtures of string instruments into theirconstituent sources using these approaches.To identify and extract many commonly encountered objects, we leverageappearance–audio co-occurrence in large datasets. This complementary associationmechanism is particularly useful for objects where motion-based correlations are notvisible or available. The problem is dealt with in a weakly-supervised setting whereinwe design a representation learning framework for robust AV event classification,visual object localization, audio event detection and source separation.We extensively test the proposed ideas on publicly available datasets. The experimentsdemonstrate several intuitive multimodal phenomena that humans utilize on aregular basis for robust scene understanding.L'objectif de cette thèse est de concevoir des algorithmes qui permettent la détection robuste d’objets et d’événements dans des vidéos en s’appuyant sur une analyse conjointe de données audio et visuelle. Ceci est inspiré par la capacité remarquable des humains à intégrer les caractéristiques auditives et visuelles pour améliorer leur compréhension de scénarios bruités. À cette fin, nous nous appuyons sur deux types d'associations naturelles entre les modalités d'enregistrements audiovisuels (réalisés à l'aide d'un seul microphone et d'une seule caméra), à savoir la corrélation mouvement/audio et la co-occurrence apparence/audio. Dans le premier cas, nous utilisons la séparation de sources audio comme application principale et proposons deux nouvelles méthodes dans le cadre classique de la factorisation par matrices non négatives (NMF). L'idée centrale est d'utiliser la corrélation temporelle entre l'audio et le mouvement pour les objets / actions où le mouvement produisant le son est visible. La première méthode proposée met l'accent sur le couplage flexible entre les représentations audio et de mouvement capturant les variations temporelles, tandis que la seconde repose sur la régression intermodale. Nous avons séparé plusieurs mélanges complexes d'instruments à cordes en leurs sources constituantes en utilisant ces approches.Pour identifier et extraire de nombreux objets couramment rencontrés, nous exploitons la co-occurrence apparence/audio dans de grands ensembles de données. Ce mécanisme d'association complémentaire est particulièrement utile pour les objets où les corrélations basées sur le mouvement ne sont ni visibles ni disponibles. Le problème est traité dans un contexte faiblement supervisé dans lequel nous proposons un framework d’apprentissage de représentation pour la classification robuste des événements audiovisuels, la localisation des objets visuels, la détection des événements audio et la séparation de sources.Nous avons testé de manière approfondie les idées proposées sur des ensembles de données publics. Ces expériences permettent de faire un lien avec des phénomènes intuitifs et multimodaux que les humains utilisent dans leur processus de compréhension de scènes audiovisuelles

Thèses en Ligne

thèses en ligne de ParisTech

Theses.fr

Improving audio retrieval through loudness profile categorization

Author: Font Corbera Frederic
Parekh Sanjeel
Serra Xavier
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

Comunicació presentada al 2016 IEEE International Symposium on Multimedia, celebrat els dies 11 a 13 de desembre de 2016 a San José, Califòrnia.The increasing popularity of audio content sharing in online platforms requires the development of techniques to better organize and retrieve this data. In this paper we look at how to improve similarity search through content categorization in the context of Freesound, a popular online sound sharing site. We focus on organization based on morphological description. In particular, we propose to improve search results by incorporating information about query sound's loudness profile. This is performed within a thresholding based framework and can be generalized to structure information about the temporal evolution of other sound attributes. We perform a subjective evaluation to demonstrate the practical relevance of our method

Crossref

UPF Digital Repository

Improving audio retrieval through loudness profile categorization

Author: Font Corbera Frederic
Parekh Sanjeel
Serra Xavier
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

RECERCAT

Continuous emotion transfer using kernels

Author: d'Alché-Buc Florence
Lambert Alex
Parekh Sanjeel
Szabo Zoltan
Publication venue
Publication date: 13/12/2021
Field of study

Style transfer is a central problem of machine learning with numerous successful applications. In this work, we present a novel style transfer framework building upon infinite task learning and vector-valued reproducing kernel Hilbert spaces. We consider style transfer as a functional output regression task where the goal is to transform the input objects to a continuum of styles. The learnt mapping is governed by the choice of two kernels, one on the object space and one on the style space, providing flexibility to the approach. We instantiate the idea in emotion transfer where facial landmarks play the role of objects and styles correspond to emotions. The proposed approach provides a principled way to gain explicit control over the continuous style space, allowing to transform landmarks to emotions not seen during the training phase. We demonstrate the efficiency of the technique on popular facial emotion benchmarks, achieving low reconstruction cos

LSE Research Online

Listen to Interpret: Post-hoc Interpretability for Audio Networks with NMF

Author: Florence d’Alch ́e-Buc
Ga ̈el Richard
Jayneel Parekh
Pavlo Mozharovskyi
Sanjeel Parekh
Publication venue: Open Science Framework
Publication date: 23/02/2022
Field of study

This paper tackles post-hoc interpretability for audio processing networks. Our goal is to interpret decisions of a network in terms of high-level audio objects that are also listenable for the end-user. To this end, we propose a novel interpreter design that incorporates non-negative matrix factorization (NMF). In particular, a carefully regularized interpreter module is trained to take hidden layer representations of the targeted network as input and produce time activations of pre-learnt NMF components as intermediate outputs. Our methodology allows us to generate intuitive audio-based interpretations that explicitly enhance parts of the input signal most relevant for a network’s decision. We demonstrate our method’s applicability on popular benchmarks, including a real-world multi-label classification task

arXiv.org e-Print Archive

OSF Preprints