3,052 research outputs found
PerformanceNet: Score-to-Audio Music Generation with Multi-Band Convolutional Residual Network
Music creation is typically composed of two parts: composing the musical
score, and then performing the score with instruments to make sounds. While
recent work has made much progress in automatic music generation in the
symbolic domain, few attempts have been made to build an AI model that can
render realistic music audio from musical scores. Directly synthesizing audio
with sound sample libraries often leads to mechanical and deadpan results,
since musical scores do not contain performance-level information, such as
subtle changes in timing and dynamics. Moreover, while the task may sound like
a text-to-speech synthesis problem, there are fundamental differences since
music audio has rich polyphonic sounds. To build such an AI performer, we
propose in this paper a deep convolutional model that learns in an end-to-end
manner the score-to-audio mapping between a symbolic representation of music
called the piano rolls and an audio representation of music called the
spectrograms. The model consists of two subnets: the ContourNet, which uses a
U-Net structure to learn the correspondence between piano rolls and
spectrograms and to give an initial result; and the TextureNet, which further
uses a multi-band residual network to refine the result by adding the spectral
texture of overtones and timbre. We train the model to generate music clips of
the violin, cello, and flute, with a dataset of moderate size. We also present
the result of a user study that shows our model achieves higher mean opinion
score (MOS) in naturalness and emotional expressivity than a WaveNet-based
model and two commercial sound libraries. We open our source code at
https://github.com/bwang514/PerformanceNetComment: 8 pages, 6 figures, AAAI 2019 camera-ready versio
Pop Music Highlighter: Marking the Emotion Keypoints
The goal of music highlight extraction is to get a short consecutive segment
of a piece of music that provides an effective representation of the whole
piece. In a previous work, we introduced an attention-based convolutional
recurrent neural network that uses music emotion classification as a surrogate
task for music highlight extraction, for Pop songs. The rationale behind that
approach is that the highlight of a song is usually the most emotional part.
This paper extends our previous work in the following two aspects. First,
methodology-wise we experiment with a new architecture that does not need any
recurrent layers, making the training process faster. Moreover, we compare a
late-fusion variant and an early-fusion variant to study which one better
exploits the attention mechanism. Second, we conduct and report an extensive
set of experiments comparing the proposed attention-based methods against a
heuristic energy-based method, a structural repetition-based method, and a few
other simple feature-based methods for this task. Due to the lack of
public-domain labeled data for highlight extraction, following our previous
work we use the RWC POP 100-song data set to evaluate how the detected
highlights overlap with any chorus sections of the songs. The experiments
demonstrate the effectiveness of our methods over competing methods. For
reproducibility, we open source the code and pre-trained model at
https://github.com/remyhuang/pop-music-highlighter/.Comment: Transactions of the ISMIR vol. 1, no.
Deep Learning Techniques for Music Generation -- A Survey
This paper is a survey and an analysis of different ways of using deep
learning (deep artificial neural networks) to generate musical content. We
propose a methodology based on five dimensions for our analysis:
Objective - What musical content is to be generated? Examples are: melody,
polyphony, accompaniment or counterpoint. - For what destination and for what
use? To be performed by a human(s) (in the case of a musical score), or by a
machine (in the case of an audio file).
Representation - What are the concepts to be manipulated? Examples are:
waveform, spectrogram, note, chord, meter and beat. - What format is to be
used? Examples are: MIDI, piano roll or text. - How will the representation be
encoded? Examples are: scalar, one-hot or many-hot.
Architecture - What type(s) of deep neural network is (are) to be used?
Examples are: feedforward network, recurrent network, autoencoder or generative
adversarial networks.
Challenge - What are the limitations and open challenges? Examples are:
variability, interactivity and creativity.
Strategy - How do we model and control the process of generation? Examples
are: single-step feedforward, iterative feedforward, sampling or input
manipulation.
For each dimension, we conduct a comparative analysis of various models and
techniques and we propose some tentative multidimensional typology. This
typology is bottom-up, based on the analysis of many existing deep-learning
based systems for music generation selected from the relevant literature. These
systems are described and are used to exemplify the various choices of
objective, representation, architecture, challenge and strategy. The last
section includes some discussion and some prospects.Comment: 209 pages. This paper is a simplified version of the book: J.-P.
Briot, G. Hadjeres and F.-D. Pachet, Deep Learning Techniques for Music
Generation, Computational Synthesis and Creative Systems, Springer, 201
- …