Synthetic creation of drum sounds (e.g., in drum machines) is commonly
performed using analog or digital synthesis, allowing a musician to sculpt the
desired timbre modifying various parameters. Typically, such parameters control
low-level features of the sound and often have no musical meaning or perceptual
correspondence. With the rise of Deep Learning, data-driven processing of audio
emerges as an alternative to traditional signal processing. This new paradigm
allows controlling the synthesis process through learned high-level features or
by conditioning a model on musically relevant information. In this paper, we
apply a Generative Adversarial Network to the task of audio synthesis of drum
sounds. By conditioning the model on perceptual features computed with a
publicly available feature-extractor, intuitive control is gained over the
generation process. The experiments are carried out on a large collection of
kick, snare, and cymbal sounds. We show that, compared to a specific prior work
based on a U-Net architecture, our approach considerably improves the quality
of the generated drum samples, and that the conditional input indeed shapes the
perceptual characteristics of the sounds. Also, we provide audio examples and
release the code used in our experiments.Comment: 8 pages, 1 figure, 3 tables, accepted in Proc. of the 21st
International Society for Music Information Retrieval (ISMIR2020