37 research outputs found
Vocal imitation for query by vocalisation
PhD ThesisThe human voice presents a rich and powerful medium for expressing sonic ideas such as musical sounds. This capability extends beyond the sounds used in speech, evidenced for example in the art form of beatboxing, and recent studies highlighting the utility of vocal imitation for communicating sonic concepts. Meanwhile, the advance of digital audio has resulted in huge libraries of sounds at the disposal of music producers and sound designers. This presents a compelling search problem: with larger search spaces, the task of navigating sound libraries has become increasingly difficult. The versatility and expressive nature of the voice provides a seemingly ideal medium for querying sound libraries, raising the question of how well humans are able to vocally imitate
musical sounds, and how we might use the voice as a tool for search. In this thesis we address these questions by investigating the ability of musicians to
vocalise synthesised and percussive sounds, and evaluate the suitability of different audio features for predicting the perceptual similarity between vocal
imitations and imitated sounds.
In the first experiment, musicians were tasked with imitating synthesised sounds with one or two time–varying feature envelopes applied. The results
show that participants were able to imitate pitch, loudness, and spectral centroid features accurately, and that imitation accuracy was generally preserved
when the imitated stimuli combined two, non-necessarily congruent features. This demonstrates the viability of using the voice as a natural means of
expressing time series of two features simultaneously. The second experiment consisted of two parts. In a vocal production task,
musicians were asked to imitate drum sounds. Listeners were then asked to rate the similarity between the imitations and sounds from the same category
(e.g. kick, snare etc.). The results show that drum sounds received the highest similarity ratings when rated against their imitations (as opposed to imitations of another sound), and overall more than half the imitated sounds were correctly identified with above chance accuracy from the imitations, although
this varied considerably between drum categories.
The findings from the vocal imitation experiments highlight the capacity of musicians to vocally imitate musical sounds, and some limitations of non–
verbal vocal expression. Finally, we investigated the performance of different audio features as predictors of perceptual similarity between the imitations and
imitated sounds from the second experiment. We show that features learned using convolutional auto–encoders outperform a number of popular heuristic
features for this task, and that preservation of temporal information is more important than spectral resolution for differentiating between the vocal imitations and same–category drum sounds
Data-Driven Query by Vocal Percussion
The imitation of percussive sounds via the human voice is a natural and effective tool for communicating rhythmic ideas on the fly. Query by Vocal Percussion (QVP) is a subfield in Music Information Retrieval (MIR) that explores techniques to query percussive sounds using vocal imitations as input, usually plosive consonant sounds. In this way, fully automated QVP systems can help artists prototype drum patterns in a comfortable and quick way, smoothing the creative workflow as a result. This project explores the potential usefulness of recent data-driven neural network models in two of the most important tasks in QVP. Algorithms relative to Vocal Percussion Transcription (VPT) detect and classify vocal percussion sound events in a beatbox-like performance so to trigger individual drum samples. Algorithms relative to Drum Sample Retrieval by Vocalisation (DSRV) use input vocal imitations to pick appropriate drum samples from a sound library via timbral similarity. Our experiments with several kinds of data-driven deep neural networks suggest that these achieve better results in both VPT and DSRV compared to traditional data-informed approaches based on heuristic audio features. We also find that these networks, when paired with strong regularisation techniques, can still outperform data-informed approaches when data is scarce. Finally, we gather several insights relative to people’s approach to vocal percussion and how user-based algorithms are essential to better model individual differences in vocalisation styles
Spectral and Temporal Timbral Cues of Vocal Imitations of Drum Sounds
The imitation of non-vocal sounds using the human voice is a resource we sometimes rely on when
communicating sound concepts to other people. Query by Vocal Percussion (QVP) is a subfield in Music
Information..
SampleMatch: Drum Sample Retrieval by Musical Context
Modern digital music production typically involves combining numerous
acoustic elements to compile a piece of music. Important types of such elements
are drum samples, which determine the characteristics of the percussive
components of the piece. Artists must use their aesthetic judgement to assess
whether a given drum sample fits the current musical context. However,
selecting drum samples from a potentially large library is tedious and may
interrupt the creative flow. In this work, we explore the automatic drum sample
retrieval based on aesthetic principles learned from data. As a result, artists
can rank the samples in their library by fit to some musical context at
different stages of the production process (i.e., by fit to incomplete song
mixtures). To this end, we use contrastive learning to maximize the score of
drum samples originating from the same song as the mixture. We conduct a
listening test to determine whether the human ratings match the automatic
scoring function. We also perform objective quantitative analyses to evaluate
the efficacy of our approach.Comment: 8 pages, 3 figures, 1 table; Accepted at the ISMIR conference,
Bengaluru, India, 202
A New Dataset for Amateur Vocal Percussion Analysis
The imitation of percussive instruments via the human voice is a natural way
for us to communicate rhythmic ideas and, for this reason, it attracts the
interest of music makers. Specifically, the automatic mapping of these vocal
imitations to their emulated instruments would allow creators to realistically
prototype rhythms in a faster way. The contribution of this study is two-fold.
Firstly, a new Amateur Vocal Percussion (AVP) dataset is introduced to
investigate how people with little or no experience in beatboxing approach the
task of vocal percussion. The end-goal of this analysis is that of helping
mapping algorithms to better generalise between subjects and achieve higher
performances. The dataset comprises a total of 9780 utterances recorded by 28
participants with fully annotated onsets and labels (kick drum, snare drum,
closed hi-hat and opened hi-hat). Lastly, we conducted baseline experiments on
audio onset detection with the recorded dataset, comparing the performance of
four state-of-the-art algorithms in a vocal percussion context
Vocal imitation for query by vocalisation
PhDThe human voice presents a rich and powerful medium for expressing sonic
ideas such as musical sounds. This capability extends beyond the sounds used
in speech, evidenced for example in the art form of beatboxing, and recent
studies highlighting the utility of vocal imitation for communicating sonic concepts.
Meanwhile, the advance of digital audio has resulted in huge libraries of
sounds at the disposal of music producers and sound designers. This presents
a compelling search problem: with larger search spaces, the task of navigating
sound libraries has become increasingly difficult. The versatility and expressive
nature of the voice provides a seemingly ideal medium for querying sound
libraries, raising the question of how well humans are able to vocally imitate
musical sounds, and how we might use the voice as a tool for search. In this
thesis we address these questions by investigating the ability of musicians to
vocalise synthesised and percussive sounds, and evaluate the suitability of different
audio features for predicting the perceptual similarity between vocal
imitations and imitated sounds.
In the fi rst experiment, musicians were tasked with imitating synthesised
sounds with one or two time{varying feature envelopes applied. The results
show that participants were able to imitate pitch, loudness, and spectral centroid
features accurately, and that imitation accuracy was generally preserved
when the imitated stimuli combined two, non-necessarily congruent features.
This demonstrates the viability of using the voice as a natural means of
expressing time series of two features simultaneously.
The second experiment consisted of two parts. In a vocal production task,
musicians were asked to imitate drum sounds. Listeners were then asked to
rate the similarity between the imitations and sounds from the same category
(e.g. kick, snare etc.). The results show that drum sounds received the highest
similarity ratings when rated against their imitations (as opposed to imitations
of another sound), and overall more than half the imitated sounds were
correctly identi ed with above chance accuracy from the imitations, although
this varied considerably between drum categories.
The fi ndings from the vocal imitation experiments highlight the capacity
of musicians to vocally imitate musical sounds, and some limitations of non-
verbal vocal expression. Finally, we investigated the performance of different
audio features as predictors of perceptual similarity between the imitations and
imitated sounds from the second experiment. We show that features learned
using convolutional auto-encoders outperform a number of popular heuristic
features for this task, and that preservation of temporal information is more
important than spectral resolution for differentiating between the vocal imitations
and same-category drum sounds.Engineering and Physical Sciences Research Council (EP/G03723X/1)
Automatic characterization and generation of music loops and instrument samples for electronic music production
Repurposing audio material to create new music - also known as sampling - was a foundation of electronic music and is a fundamental component of this practice. Currently, large-scale databases of audio offer vast collections of audio material for users to work with. The navigation on these databases is heavily focused on hierarchical tree directories. Consequently, sound retrieval is tiresome and often identified as an undesired interruption in the creative process.
We address two fundamental methods for navigating sounds: characterization and generation. Characterizing loops and one-shots in terms of instruments or instrumentation allows for organizing unstructured collections and a faster retrieval for music-making. The generation of loops and one-shot sounds enables the creation of new sounds not present in an audio collection through interpolation or modification of the existing material. To achieve this, we employ deep-learning-based data-driven methodologies for classification and generation.Repurposing audio material to create new music - also known as sampling - was a foundation of electronic music and is a fundamental component of this practice. Currently, large-scale databases of audio offer vast collections of audio material for users to work with. The navigation on these databases is heavily focused on hierarchical tree directories. Consequently, sound retrieval is tiresome and often identified as an undesired interruption in the creative process.
We address two fundamental methods for navigating sounds: characterization and generation. Characterizing loops and one-shots in terms of instruments or instrumentation allows for organizing unstructured collections and a faster retrieval for music-making. The generation of loops and one-shot sounds enables the creation of new sounds not present in an audio collection through interpolation or modification of the existing material. To achieve this, we employ deep-learning-based data-driven methodologies for classification and generation
Recommended from our members
Musical source separation with deep learning and large-scale datasets
Throughout this thesis we will explore automatic music source separation by utilizing modern (at the time of writing) techniques and tools from machine learning and big data processing. The bulk of this work was carried out between 2016 and 2019.
In Chapter 2 we conduct a review of source separation literature. We start by outlining a subset of applications of source separation in some depth. We describe some of the early, pioneering work in automatic source separation: Auditory Scene Analysis, and its digital counterpart, Computational Auditory Scene Analysis.
We then introduce matrix decomposition-based methods such as Independent Component Analysis and Non-Negative Matrix factorization, and pitch informed methods where the separation algorithm is guided by pitch information that is known a priori. We brie y discuss user-guided methods, before conducting a thorough review of Deep Learning based source separation, including recurrent, convolutional, deep clustering-based, and Generative Adversarial Networks.
We then proceed to describe common evaluation metrics
and training datasets. Finally, we list a number of current challenges and drawbacks of current systems.
Chapter 3 focuses on datasets for musical source separation. First we show the growth of dataset sizes for both machine learning in general and music information retrieval specifically. We give several examples of the complexities and idiosyncrasies that are intrinsic to music datasets. We then proceed to present a method for extracting ground truth data for source separation from large unstructured musical catalogs.
In Chapter 4 we design a novel deep learning-based source separation algorithm. Motivation is provided by means of a musicological study1 that showed the high importance of vocals relative to other musical factors, in the minds of listeners. At the core of the vocal separation algorithm is the U-Net, a deep learning architecture that uses skip connections to preserve fine-grained detail. It was originally developed in the biomedical imaging domain, and later adapted to image-to-image translation. We adapt it to the source separation domain by treating spectrograms as images, and we use the dataset mining methods from Chapter 3 to generate sufficiently large training data. We evaluate our model objectively using standard evaluation metrics, subjectively using \crowdsourced" human subjects. To the best of our knowledge, this is the first use of U-Nets for source separation.
In the introduction above we proposed joint learning to optimize source separation and other objectives. In Chapter 5 we investigate one such instance: multi-task learning of vocal removal and vocal pitch tracking. We combine the vocal separation model from Chapter 4 with a state of the art pitch salience estimation model2, exploring several ways of combining the two models. We find that vocal pitch estimation benefits from joint learning when the two tasks are trained in sequence, with the source separation model preceding the pitch estimation model. We also report benefits from fine-tuning by iteratively applying the model.
Chapter 6 extends the U-Net model to multiple instruments. In order to minimize the phase artifacts that were a common issue in Chapter 4, we modify the model to operate in the complex domain. We run experiments with several loss functions: Time-domain loss, magnitude-only frequency domain loss, and joint time and frequency-domain loss. Our experiments are evaluated both objectively and subjectively, and we carry out extensive qualitative analysis to investigate the effects of complex masking.
Finally, we conclude the thesis in Chapter 7 by summarizing this work and highlighting several future directions of research
DMRN+16: Digital Music Research Network One-day Workshop 2021
DMRN+16: Digital Music Research Network One-day Workshop 2021 Queen Mary University of London Tuesday 21st December 2021 Keynote speakers Keynote 1. Prof. Sophie Scott -Director, Institute of Cognitive Neuroscience, UCL. Title: "Sound on the brain - insights from functional neuroimaging and neuroanatomy" Abstract In this talk I will use functional imaging and models of primate neuroanatomy to explore how sound is processed in the human brain. I will demonstrate that sound is represented cortically in different parallel streams. I will expand this to show how this can impact on the concept of auditory perception, which arguably incorporates multiple kinds of distinct perceptual processes. I will address the roles that subcortical processes play in this, and also the contributions from hemispheric asymmetries. Keynote 2: Prof. Gus Xia - Assistant Professor at NYU Shanghai Title: "Learning interpretable music representations: from human stupidity to artificial intelligence" Abstract Gus has been leading the Music X Lab in developing intelligent systems that help people better compose and learn music. In this talk, he will show us the importance of music representation for both humans and machines, and how to learn better music representations via the design of inductive bias. Once we got interpretable music representations, the potential applications are limitless
Deep Learning Techniques for Music Generation -- A Survey
This paper is a survey and an analysis of different ways of using deep
learning (deep artificial neural networks) to generate musical content. We
propose a methodology based on five dimensions for our analysis:
Objective - What musical content is to be generated? Examples are: melody,
polyphony, accompaniment or counterpoint. - For what destination and for what
use? To be performed by a human(s) (in the case of a musical score), or by a
machine (in the case of an audio file).
Representation - What are the concepts to be manipulated? Examples are:
waveform, spectrogram, note, chord, meter and beat. - What format is to be
used? Examples are: MIDI, piano roll or text. - How will the representation be
encoded? Examples are: scalar, one-hot or many-hot.
Architecture - What type(s) of deep neural network is (are) to be used?
Examples are: feedforward network, recurrent network, autoencoder or generative
adversarial networks.
Challenge - What are the limitations and open challenges? Examples are:
variability, interactivity and creativity.
Strategy - How do we model and control the process of generation? Examples
are: single-step feedforward, iterative feedforward, sampling or input
manipulation.
For each dimension, we conduct a comparative analysis of various models and
techniques and we propose some tentative multidimensional typology. This
typology is bottom-up, based on the analysis of many existing deep-learning
based systems for music generation selected from the relevant literature. These
systems are described and are used to exemplify the various choices of
objective, representation, architecture, challenge and strategy. The last
section includes some discussion and some prospects.Comment: 209 pages. This paper is a simplified version of the book: J.-P.
Briot, G. Hadjeres and F.-D. Pachet, Deep Learning Techniques for Music
Generation, Computational Synthesis and Creative Systems, Springer, 201