687 research outputs found
Learning to Separate Object Sounds by Watching Unlabeled Video
Perceiving a scene most fully requires all the senses. Yet modeling how
objects look and sound is challenging: most natural scenes and events contain
multiple objects, and the audio track mixes all the sound sources together. We
propose to learn audio-visual object models from unlabeled video, then exploit
the visual context to perform audio source separation in novel videos. Our
approach relies on a deep multi-instance multi-label learning framework to
disentangle the audio frequency bases that map to individual visual objects,
even without observing/hearing those objects in isolation. We show how the
recovered disentangled bases can be used to guide audio source separation to
obtain better-separated, object-level sounds. Our work is the first to learn
audio source separation from large-scale "in the wild" videos containing
multiple audio sources per video. We obtain state-of-the-art results on
visually-aided audio source separation and audio denoising. Our video results:
http://vision.cs.utexas.edu/projects/separating_object_sounds/Comment: Published in ECCV 2018; Project Page:
http://vision.cs.utexas.edu/projects/separating_object_sounds
The Sound of Motions
Sounds originate from object motions and vibrations of surrounding air.
Inspired by the fact that humans is capable of interpreting sound sources from
how objects move visually, we propose a novel system that explicitly captures
such motion cues for the task of sound localization and separation. Our system
is composed of an end-to-end learnable model called Deep Dense Trajectory
(DDT), and a curriculum learning scheme. It exploits the inherent coherence of
audio-visual signals from a large quantities of unlabeled videos. Quantitative
and qualitative evaluations show that comparing to previous models that rely on
visual appearance cues, our motion based system improves performance in
separating musical instrument sounds. Furthermore, it separates sound
components from duets of the same category of instruments, a challenging
problem that has not been addressed before
Learning to Detect and Retrieve Objects from Unlabeled Videos
Learning an object detector or retrieval requires a large data set with
manual annotations. Such data sets are expensive and time consuming to create
and therefore difficult to obtain on a large scale. In this work, we propose to
exploit the natural correlation in narrations and the visual presence of
objects in video, to learn an object detector and retrieval without any manual
labeling involved. We pose the problem as weakly supervised learning with noisy
labels, and propose a novel object detection paradigm under these constraints.
We handle the background rejection by using contrastive samples and confront
the high level of label noise with a new clustering score. Our evaluation is
based on a set of 11 manually annotated objects in over 5000 frames. We show
comparison to a weakly-supervised approach as baseline and provide a strongly
labeled upper bound.Comment: ICCV 2019 Workshop on Multi-modal Video Analysis and Moments in Time
Challeng
VisualEchoes: Spatial Image Representation Learning through Echolocation
Several animal species (e.g., bats, dolphins, and whales) and even visually
impaired humans have the remarkable ability to perform echolocation: a
biological sonar used to perceive spatial layout and locate objects in the
world. We explore the spatial cues contained in echoes and how they can benefit
vision tasks that require spatial reasoning. First we capture echo responses in
photo-realistic 3D indoor scene environments. Then we propose a novel
interaction-based representation learning framework that learns useful visual
features via echolocation. We show that the learned image features are useful
for multiple downstream vision tasks requiring spatial reasoning---monocular
depth estimation, surface normal estimation, and visual navigation---with
results comparable or even better than heavily supervised pre-training. Our
work opens a new path for representation learning for embodied agents, where
supervision comes from interacting with the physical world.Comment: Appears in ECCV 202
Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision
We tackle the problem of audiovisual scene analysis for weakly-labeled data.
To this end, we build upon our previous audiovisual representation learning
framework to perform object classification in noisy acoustic environments and
integrate audio source enhancement capability. This is made possible by a novel
use of non-negative matrix factorization for the audio modality. Our approach
is founded on the multiple instance learning paradigm. Its effectiveness is
established through experiments over a challenging dataset of music instrument
performance videos. We also show encouraging visual object localization
results
A Simple Baseline for Audio-Visual Scene-Aware Dialog
The recently proposed audio-visual scene-aware dialog task paves the way to a
more data-driven way of learning virtual assistants, smart speakers and car
navigation systems. However, very little is known to date about how to
effectively extract meaningful information from a plethora of sensors that
pound the computational engine of those devices. Therefore, in this paper, we
provide and carefully analyze a simple baseline for audio-visual scene-aware
dialog which is trained end-to-end. Our method differentiates in a data-driven
manner useful signals from distracting ones using an attention mechanism. We
evaluate the proposed approach on the recently introduced and challenging
audio-visual scene-aware dataset, and demonstrate the key features that permit
to outperform the current state-of-the-art by more than 20\% on CIDEr.Comment: Accepted to CVPR 201
Unsupervised Representation Learning by Sorting Sequences
We present an unsupervised representation learning approach using videos
without semantic labels. We leverage the temporal coherence as a supervisory
signal by formulating representation learning as a sequence sorting task. We
take temporally shuffled frames (i.e., in non-chronological order) as inputs
and train a convolutional neural network to sort the shuffled sequences.
Similar to comparison-based sorting algorithms, we propose to extract features
from all frame pairs and aggregate them to predict the correct order. As
sorting shuffled image sequence requires an understanding of the statistical
temporal structure of images, training with such a proxy task allows us to
learn rich and generalizable visual representation. We validate the
effectiveness of the learned representation using our method as pre-training on
high-level recognition problems. The experimental results show that our method
compares favorably against state-of-the-art methods on action recognition,
image classification and object detection tasks.Comment: ICCV 2017. Project page: http://vllab1.ucmerced.edu/~hylee/OPN
Bootstrapping single-channel source separation via unsupervised spatial clustering on stereo mixtures
Separating an audio scene into isolated sources is a fundamental problem in
computer audition, analogous to image segmentation in visual scene analysis.
Source separation systems based on deep learning are currently the most
successful approaches for solving the underdetermined separation problem, where
there are more sources than channels. Traditionally, such systems are trained
on sound mixtures where the ground truth decomposition is already known. Since
most real-world recordings do not have such a decomposition available, this
limits the range of mixtures one can train on, and the range of mixtures the
learned models may successfully separate. In this work, we use a simple blind
spatial source separation algorithm to generate estimated decompositions of
stereo mixtures. These estimates, together with a weighting scheme in the
time-frequency domain, based on confidence in the separation quality, are used
to train a deep learning model that can be used for single-channel separation,
where no source direction information is available. This demonstrates how a
simple cue such as the direction of origin of source can be used to bootstrap a
model for source separation that can be used in situations where that cue is
not available.Comment: 5 pages, 2 figure
Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications
Visual events are usually accompanied by sounds in our daily lives. However,
can the machines learn to correlate the visual scene and sound, as well as
localize the sound source only by observing them like humans? To investigate
its empirical learnability, in this work we first present a novel unsupervised
algorithm to address the problem of localizing sound sources in visual scenes.
In order to achieve this goal, a two-stream network structure which handles
each modality with attention mechanism is developed for sound source
localization. The network naturally reveals the localized response in the scene
without human annotation. In addition, a new sound source dataset is developed
for performance evaluation. Nevertheless, our empirical evaluation shows that
the unsupervised method generates false conclusions in some cases. Thereby, we
show that this false conclusion cannot be fixed without human prior knowledge
due to the well-known correlation and causality mismatch misconception. To fix
this issue, we extend our network to the supervised and semi-supervised network
settings via a simple modification due to the general architecture of our
two-stream network. We show that the false conclusions can be effectively
corrected even with a small amount of supervision, i.e., semi-supervised setup.
Furthermore, we present the versatility of the learned audio and visual
embeddings on the cross-modal content alignment and we extend this proposed
algorithm to a new application, sound saliency based automatic camera view
panning in 360-degree{\deg} videos.Comment: To appear in TPAMI. arXiv admin note: substantial text overlap with
arXiv:1803.0384
VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency
We introduce a new approach for audio-visual speech separation. Given a
video, the goal is to extract the speech associated with a face in spite of
simultaneous background sounds and/or other human speakers. Whereas existing
methods focus on learning the alignment between the speaker's lip movements and
the sounds they generate, we propose to leverage the speaker's face appearance
as an additional prior to isolate the corresponding vocal qualities they are
likely to produce. Our approach jointly learns audio-visual speech separation
and cross-modal speaker embeddings from unlabeled video. It yields
state-of-the-art results on five benchmark datasets for audio-visual speech
separation and enhancement, and generalizes well to challenging real-world
videos of diverse scenarios. Our video results and code:
http://vision.cs.utexas.edu/projects/VisualVoice/.Comment: In CVPR 2021. Project page:
http://vision.cs.utexas.edu/projects/VisualVoice
- …