4 research outputs found
Weakly-supervised Visual Instrument-playing Action Detection in Videos
Instrument playing is among the most common scenes in music-related videos,
which represent nowadays one of the largest sources of online videos. In order
to understand the instrument-playing scenes in the videos, it is important to
know what instruments are played, when they are played, and where the playing
actions occur in the scene. While audio-based recognition of instruments has
been widely studied, the visual aspect of the music instrument playing remains
largely unaddressed in the literature. One of the main obstacles is the
difficulty in collecting annotated data of the action locations for
training-based methods. To address this issue, we propose a weakly-supervised
framework to find when and where the instruments are played in the videos. We
propose to use two auxiliary models, a sound model and an object model, to
provide supervisions for training the instrument-playing action model. The
sound model provides temporal supervisions, while the object model provides
spatial supervisions. They together can simultaneously provide temporal and
spatial supervisions. The resulted model only needs to analyze the visual part
of a music video to deduce which, when and where instruments are played. We
found that the proposed method significantly improves the localization
accuracy. We evaluate the result of the proposed method temporally and
spatially on a small dataset (totally 5,400 frames) that we manually annotated
Multitask learning for frame-level instrument recognition
For many music analysis problems, we need to know the presence of instruments
for each time frame in a multi-instrument musical piece. However, such a
frame-level instrument recognition task remains difficult, mainly due to the
lack of labeled datasets. To address this issue, we present in this paper a
large-scale dataset that contains synthetic polyphonic music with frame-level
pitch and instrument labels. Moreover, we propose a simple yet novel network
architecture to jointly predict the pitch and instrument for each frame. With
this multitask learning method, the pitch information can be leveraged to
predict the instruments, and also the other way around. And, by using the
so-called pianoroll representation of music as the main target output of the
model, our model also predicts the instruments that play each individual note
event. We validate the effectiveness of the proposed method for framelevel
instrument recognition by comparing it with its singletask ablated versions and
three state-of-the-art methods. We also demonstrate the result of the proposed
method for multipitch streaming with real-world music. For reproducibility, we
will share the code to crawl the data and to implement the proposed model at:
https://github.com/biboamy/ instrument-streaming.Comment: This is a pre-print version of an ICASSP 2019 pape
Musical Composition Style Transfer via Disentangled Timbre Representations
Music creation involves not only composing the different parts (e.g., melody,
chords) of a musical work but also arranging/selecting the instruments to play
the different parts. While the former has received increasing attention, the
latter has not been much investigated. This paper presents, to the best of our
knowledge, the first deep learning models for rearranging music of arbitrary
genres. Specifically, we build encoders and decoders that take a piece of
polyphonic musical audio as input and predict as output its musical score. We
investigate disentanglement techniques such as adversarial training to separate
latent factors that are related to the musical content (pitch) of different
parts of the piece, and that are related to the instrumentation (timbre) of the
parts per short-time segment. By disentangling pitch and timbre, our models
have an idea of how each piece was composed and arranged. Moreover, the models
can realize "composition style transfer" by rearranging a musical piece without
much affecting its pitch content. We validate the effectiveness of the models
by experiments on instrument activity detection and composition style transfer.
To facilitate follow-up research, we open source our code at
https://github.com/biboamy/instrument-disentangle.Comment: Accepted by the 28th International Joint Conference on Artificial
Intelligence. arXiv admin note: text overlap with arXiv:1811.0327
Temporal Action Localization using Long Short-Term Dependency
Temporal action localization in untrimmed videos is an important but
difficult task. Difficulties are encountered in the application of existing
methods when modeling temporal structures of videos. In the present study, we
developed a novel method, referred to as Gemini Network, for effective modeling
of temporal structures and achieving high-performance temporal action
localization. The significant improvements afforded by the proposed method are
attributable to three major factors. First, the developed network utilizes two
subnets for effective modeling of temporal structures. Second, three parallel
feature extraction pipelines are used to prevent interference between the
extractions of different stage features. Third, the proposed method utilizes
auxiliary supervision, with the auxiliary classifier losses affording
additional constraints for improving the modeling capability of the network. As
a demonstration of its effectiveness, the Gemini Network was used to achieve
state-of-the-art temporal action localization performance on two challenging
datasets, namely, THUMOS14 and ActivityNet.Comment: 12pages, Tran