Symbolic music generation aims to create musical notes, which can help users
compose music, such as generating target instrument tracks based on provided
source tracks. In practical scenarios where there's a predefined ensemble of
tracks and various composition needs, an efficient and effective generative
model that can generate any target tracks based on the other tracks becomes
crucial. However, previous efforts have fallen short in addressing this
necessity due to limitations in their music representations and models. In this
paper, we introduce a framework known as GETMusic, with ``GET'' standing for
``GEnerate music Tracks.'' This framework encompasses a novel music
representation ``GETScore'' and a diffusion model ``GETDiff.'' GETScore
represents musical notes as tokens and organizes tokens in a 2D structure, with
tracks stacked vertically and progressing horizontally over time. At a training
step, each track of a music piece is randomly selected as either the target or
source. The training involves two processes: In the forward process, target
tracks are corrupted by masking their tokens, while source tracks remain as the
ground truth; in the denoising process, GETDiff is trained to predict the
masked target tokens conditioning on the source tracks. Our proposed
representation, coupled with the non-autoregressive generative model, empowers
GETMusic to generate music with any arbitrary source-target track combinations.
Our experiments demonstrate that the versatile GETMusic outperforms prior works
proposed for certain specific composition tasks.Comment: 13 pages, 4 figure