1,658 research outputs found
Learning sound representations using trainable COPE feature extractors
Sound analysis research has mainly been focused on speech and music
processing. The deployed methodologies are not suitable for analysis of sounds
with varying background noise, in many cases with very low signal-to-noise
ratio (SNR). In this paper, we present a method for the detection of patterns
of interest in audio signals. We propose novel trainable feature extractors,
which we call COPE (Combination of Peaks of Energy). The structure of a COPE
feature extractor is determined using a single prototype sound pattern in an
automatic configuration process, which is a type of representation learning. We
construct a set of COPE feature extractors, configured on a number of training
patterns. Then we take their responses to build feature vectors that we use in
combination with a classifier to detect and classify patterns of interest in
audio signals. We carried out experiments on four public data sets: MIVIA audio
events, MIVIA road events, ESC-10 and TU Dortmund data sets. The results that
we achieved (recognition rate equal to 91.71% on the MIVIA audio events, 94% on
the MIVIA road events, 81.25% on the ESC-10 and 94.27% on the TU Dortmund)
demonstrate the effectiveness of the proposed method and are higher than the
ones obtained by other existing approaches. The COPE feature extractors have
high robustness to variations of SNR. Real-time performance is achieved even
when the value of a large number of features is computed.Comment: Accepted for publication in Pattern Recognitio
Foley Music: Learning to Generate Music from Videos
In this paper, we introduce Foley Music, a system that can synthesize
plausible music for a silent video clip about people playing musical
instruments. We first identify two key intermediate representations for a
successful video to music generator: body keypoints from videos and MIDI events
from audio recordings. We then formulate music generation from videos as a
motion-to-MIDI translation problem. We present a GraphTransformer framework
that can accurately predict MIDI event sequences in accordance with the body
movements. The MIDI event can then be converted to realistic music using an
off-the-shelf music synthesizer tool. We demonstrate the effectiveness of our
models on videos containing a variety of music performances. Experimental
results show that our model outperforms several existing systems in generating
music that is pleasant to listen to. More importantly, the MIDI representations
are fully interpretable and transparent, thus enabling us to perform music
editing flexibly. We encourage the readers to watch the demo video with audio
turned on to experience the results.Comment: ECCV 2020. Project page: http://foley-music.csail.mit.ed
- …