27 research outputs found
Ambient Sound Helps: Audiovisual Crowd Counting in Extreme Conditions
Visual crowd counting has been recently studied as a way to enable people
counting in crowd scenes from images. Albeit successful, vision-based crowd
counting approaches could fail to capture informative features in extreme
conditions, e.g., imaging at night and occlusion. In this work, we introduce a
novel task of audiovisual crowd counting, in which visual and auditory
information are integrated for counting purposes. We collect a large-scale
benchmark, named auDiovISual Crowd cOunting (DISCO) dataset, consisting of
1,935 images and the corresponding audio clips, and 170,270 annotated
instances. In order to fuse the two modalities, we make use of a linear
feature-wise fusion module that carries out an affine transformation on visual
and auditory features. Finally, we conduct extensive experiments using the
proposed dataset and approach. Experimental results show that introducing
auditory information can benefit crowd counting under different illumination,
noise, and occlusion conditions. The dataset and code will be released. Code
and data have been made availabl
Look, Listen, and Act: Towards Audio-Visual Embodied Navigation
A crucial ability of mobile intelligent agents is to integrate the evidence
from multiple sensory inputs in an environment and to make a sequence of
actions to reach their goals. In this paper, we attempt to approach the problem
of Audio-Visual Embodied Navigation, the task of planning the shortest path
from a random starting location in a scene to the sound source in an indoor
environment, given only raw egocentric visual and audio sensory data. To
accomplish this task, the agent is required to learn from various modalities,
i.e. relating the audio signal to the visual environment. Here we describe an
approach to audio-visual embodied navigation that takes advantage of both
visual and audio pieces of evidence. Our solution is based on three key ideas:
a visual perception mapper module that constructs its spatial memory of the
environment, a sound perception module that infers the relative location of the
sound source from the agent, and a dynamic path planner that plans a sequence
of actions based on the audio-visual observations and the spatial memory of the
environment to navigate toward the goal. Experimental results on a newly
collected Visual-Audio-Room dataset using the simulated multi-modal environment
demonstrate the effectiveness of our approach over several competitive
baselines.Comment: Accepted by ICRA 2020. Project page: http://avn.csail.mit.ed
Foley Music: Learning to Generate Music from Videos
In this paper, we introduce Foley Music, a system that can synthesize
plausible music for a silent video clip about people playing musical
instruments. We first identify two key intermediate representations for a
successful video to music generator: body keypoints from videos and MIDI events
from audio recordings. We then formulate music generation from videos as a
motion-to-MIDI translation problem. We present a GraphTransformer framework
that can accurately predict MIDI event sequences in accordance with the body
movements. The MIDI event can then be converted to realistic music using an
off-the-shelf music synthesizer tool. We demonstrate the effectiveness of our
models on videos containing a variety of music performances. Experimental
results show that our model outperforms several existing systems in generating
music that is pleasant to listen to. More importantly, the MIDI representations
are fully interpretable and transparent, thus enabling us to perform music
editing flexibly. We encourage the readers to watch the demo video with audio
turned on to experience the results.Comment: ECCV 2020. Project page: http://foley-music.csail.mit.ed