26 research outputs found
Foley Music: Learning to Generate Music from Videos
In this paper, we introduce Foley Music, a system that can synthesize
plausible music for a silent video clip about people playing musical
instruments. We first identify two key intermediate representations for a
successful video to music generator: body keypoints from videos and MIDI events
from audio recordings. We then formulate music generation from videos as a
motion-to-MIDI translation problem. We present a GraphTransformer framework
that can accurately predict MIDI event sequences in accordance with the body
movements. The MIDI event can then be converted to realistic music using an
off-the-shelf music synthesizer tool. We demonstrate the effectiveness of our
models on videos containing a variety of music performances. Experimental
results show that our model outperforms several existing systems in generating
music that is pleasant to listen to. More importantly, the MIDI representations
are fully interpretable and transparent, thus enabling us to perform music
editing flexibly. We encourage the readers to watch the demo video with audio
turned on to experience the results.Comment: ECCV 2020. Project page: http://foley-music.csail.mit.ed
Learning Neural Acoustic Fields
Our environment is filled with rich and dynamic acoustic information. When we
walk into a cathedral, the reverberations as much as appearance inform us of
the sanctuary's wide open space. Similarly, as an object moves around us, we
expect the sound emitted to also exhibit this movement. While recent advances
in learned implicit functions have led to increasingly higher quality
representations of the visual world, there have not been commensurate advances
in learning spatial auditory representations. To address this gap, we introduce
Neural Acoustic Fields (NAFs), an implicit representation that captures how
sounds propagate in a physical scene. By modeling acoustic propagation in a
scene as a linear time-invariant system, NAFs learn to continuously map all
emitter and listener location pairs to a neural impulse response function that
can then be applied to arbitrary sounds. We demonstrate that the continuous
nature of NAFs enables us to render spatial acoustics for a listener at an
arbitrary location, and can predict sound propagation at novel locations. We
further show that the representation learned by NAFs can help improve visual
learning with sparse views. Finally, we show that a representation informative
of scene structure emerges during the learning of NAFs.Comment: Project page:
https://www.andrew.cmu.edu/user/afluo/Neural_Acoustic_Fields