2 research outputs found
Musical Composition Style Transfer via Disentangled Timbre Representations
Music creation involves not only composing the different parts (e.g., melody,
chords) of a musical work but also arranging/selecting the instruments to play
the different parts. While the former has received increasing attention, the
latter has not been much investigated. This paper presents, to the best of our
knowledge, the first deep learning models for rearranging music of arbitrary
genres. Specifically, we build encoders and decoders that take a piece of
polyphonic musical audio as input and predict as output its musical score. We
investigate disentanglement techniques such as adversarial training to separate
latent factors that are related to the musical content (pitch) of different
parts of the piece, and that are related to the instrumentation (timbre) of the
parts per short-time segment. By disentangling pitch and timbre, our models
have an idea of how each piece was composed and arranged. Moreover, the models
can realize "composition style transfer" by rearranging a musical piece without
much affecting its pitch content. We validate the effectiveness of the models
by experiments on instrument activity detection and composition style transfer.
To facilitate follow-up research, we open source our code at
https://github.com/biboamy/instrument-disentangle.Comment: Accepted by the 28th International Joint Conference on Artificial
Intelligence. arXiv admin note: text overlap with arXiv:1811.0327
Music FaderNets: Controllable Music Generation Based On High-Level Features via Low-Level Feature Modelling
High-level musical qualities (such as emotion) are often abstract,
subjective, and hard to quantify. Given these difficulties, it is not easy to
learn good feature representations with supervised learning techniques, either
because of the insufficiency of labels, or the subjectiveness (and hence large
variance) in human-annotated labels. In this paper, we present a framework that
can learn high-level feature representations with a limited amount of data, by
first modelling their corresponding quantifiable low-level attributes. We refer
to our proposed framework as Music FaderNets, which is inspired by the fact
that low-level attributes can be continuously manipulated by separate "sliding
faders" through feature disentanglement and latent regularization techniques.
High-level features are then inferred from the low-level representations
through semi-supervised clustering using Gaussian Mixture Variational
Autoencoders (GM-VAEs). Using arousal as an example of a high-level feature, we
show that the "faders" of our model are disentangled and change linearly w.r.t.
the modelled low-level attributes of the generated output music. Furthermore,
we demonstrate that the model successfully learns the intrinsic relationship
between arousal and its corresponding low-level attributes (rhythm and note
density), with only 1% of the training set being labelled. Finally, using the
learnt high-level feature representations, we explore the application of our
framework in style transfer tasks across different arousal states. The
effectiveness of this approach is verified through a subjective listening test