104 research outputs found
Non-Negative Group Sparsity with Subspace Note Modelling for Polyphonic Transcription
This work was supported by EPSRC Platform Grant EPSRC EP/K009559/1, EPSRC Grant EP/L027119/1, and EPSRC Grant EP/J010375/1
Recommended from our members
Transcribing Multi-Instrument Polyphonic Music With Hierarchical Eigeninstruments
This paper presents a general probabilistic model for transcribing single-channel music recordings containing multiple polyphonic instrument sources. The system requires no prior knowledge of the instruments present in the mixture (other than the number), although it can benefit from information about instrument type if available. In contrast to many existing polyphonic transcription systems, our approach explicitly models the individual instruments and is thereby able to assign detected notes to their respective sources. We use training instruments to learn a set of linear manifolds in model parameter space which are then used during transcription to constrain the properties of models fit to the target mixture. This leads to a hierarchical mixture-of-subspaces design which makes it possible to supply the system with prior knowledge at different levels of abstraction. The proposed technique is evaluated on both recorded and synthesized mixtures containing two, three, four, and five instruments each. We compare our approach in terms of transcription with (i.e., detected pitches must be associated with the correct instrument) and without source-assignment to another multi-instrument transcription system as well as a baseline non-negative matrix factorization (NMF) algorithm. For two-instrument mixtures evaluated with source-assignment, we obtain average frame-level F-measures of up to 0.52 in the completely blind transcription setting (i.e., no prior knowledge of the instruments in the mixture) and up to 0.67 if we assume knowledge of the basic instrument types. For transcription without source assignment, these numbers rise to 0.76 and 0.83, respectively
Automatic Music Transcription using Structure and Sparsity
PhdAutomatic Music Transcription seeks a machine understanding of a musical signal in terms of
pitch-time activations. One popular approach to this problem is the use of spectrogram decompositions,
whereby a signal matrix is decomposed over a dictionary of spectral templates, each
representing a note. Typically the decomposition is performed using gradient descent based
methods, performed using multiplicative updates based on Non-negative Matrix Factorisation
(NMF). The final representation may be expected to be sparse, as the musical signal itself is considered
to consist of few active notes. In this thesis some concepts that are familiar in the sparse
representations literature are introduced to the AMT problem. Structured sparsity assumes that
certain atoms tend to be active together. In the context of AMT this affords the use of subspace
modelling of notes, and non-negative group sparse algorithms are proposed in order to exploit
the greater modelling capability introduced. Stepwise methods are often used for decomposing
sparse signals and their use for AMT has previously been limited. Some new approaches to
AMT are proposed by incorporation of stepwise optimal approaches with promising results seen.
Dictionary coherence is used to provide recovery conditions for sparse algorithms. While such
guarantees are not possible in the context of AMT, it is found that coherence is a useful parameter
to consider, affording improved performance in spectrogram decompositions
Automatic transcription of polyphonic music exploiting temporal evolution
PhDAutomatic music transcription is the process of converting an audio recording
into a symbolic representation using musical notation. It has numerous applications
in music information retrieval, computational musicology, and the
creation of interactive systems. Even for expert musicians, transcribing polyphonic
pieces of music is not a trivial task, and while the problem of automatic
pitch estimation for monophonic signals is considered to be solved, the creation
of an automated system able to transcribe polyphonic music without setting
restrictions on the degree of polyphony and the instrument type still remains
open.
In this thesis, research on automatic transcription is performed by explicitly
incorporating information on the temporal evolution of sounds. First efforts address
the problem by focusing on signal processing techniques and by proposing
audio features utilising temporal characteristics. Techniques for note onset and
offset detection are also utilised for improving transcription performance. Subsequent
approaches propose transcription models based on shift-invariant probabilistic
latent component analysis (SI-PLCA), modeling the temporal evolution
of notes in a multiple-instrument case and supporting frequency modulations in
produced notes. Datasets and annotations for transcription research have also
been created during this work. Proposed systems have been privately as well as
publicly evaluated within the Music Information Retrieval Evaluation eXchange
(MIREX) framework. Proposed systems have been shown to outperform several
state-of-the-art transcription approaches.
Developed techniques have also been employed for other tasks related to music
technology, such as for key modulation detection, temperament estimation,
and automatic piano tutoring. Finally, proposed music transcription models
have also been utilized in a wider context, namely for modeling acoustic scenes
A User-assisted Approach to Multiple Instrument Music Transcription
PhDThe task of automatic music transcription has been studied for several decades
and is regarded as an enabling technology for a multitude of applications such
as music retrieval and discovery, intelligent music processing and large-scale
musicological analyses. It refers to the process of identifying the musical content
of a performance and representing it in a symbolic format. Despite its long
research history, fully automatic music transcription systems are still error prone
and often fail when more complex polyphonic music is analysed. This gives
rise to the question in what ways human knowledge can be incorporated in the
transcription process.
This thesis investigates ways to involve a human user in the transcription
process. More specifically, it is investigated how user input can be employed
to derive timbre models for the instruments in a music recording, which are
employed to obtain instrument-specific (parts-based) transcriptions.
A first investigation studies different types of user input in order to derive
instrument models by means of a non-negative matrix factorisation framework.
The transcription accuracy of the different models is evaluated and a method is
proposed that refines the models by allowing each pitch of each instrument to
be represented by multiple basis functions.
A second study aims at limiting the amount of user input to make the
method more applicable in practice. Different methods are considered to estimate
missing non-negative basis functions when only a subset of basis functions can
be extracted based on the user information.
A method is proposed to track the pitches of individual instruments over time
by means of a Viterbi framework in which the states at each time frame contain
several candidate instrument-pitch combinations. A transition probability is
employed that combines three different criteria: the frame-wise reconstruction
error of each combination, a pitch continuity measure that favours similar pitches
in consecutive frames, and an explicit activity model for each instrument. The
method is shown to outperform other state-of-the-art multi-instrument tracking
methods.
Finally, the extraction of instrument models that include phase information
is investigated as a step towards complex matrix decomposition. The phase
relations between the partials of harmonic sounds are explored as a time-invariant
property that can be employed to form complex-valued basis functions. The
application of the model for a user-assisted transcription task is illustrated with a saxophone example.QMU
Automatic Transcription of Polyphonic Vocal Music
This paper presents a method for automatic music transcription applied to audio recordings of a cappella performances with multiple singers. We propose a system for multi-pitch detection and voice assignment that integrates an acoustic and a music language model. The acoustic model performs spectrogram decomposition, extending probabilistic latent component analysis (PLCA) using a six-dimensional dictionary with pre-extracted log-spectral templates. The music language model performs voice separation and assignment using hidden Markov models that apply musicological assumptions. By integrating the two models, the system is able to detect multiple concurrent pitches in polyphonic vocal music and assign each detected pitch to a specific voice type such as soprano, alto, tenor or bass (SATB). We compare our system against multiple baselines, achieving state-of-the-art results for both multi-pitch detection and voice assignment on a dataset of Bach chorales and another of barbershop quartets. We also present an additional evaluation of our system using varied pitch tolerance levels to investigate its performance at 20-cent pitch resolution
Exploiting Piano Acoustics in Automatic Transcription
This work was supported by a joint Queen Mary/China Scholarship Council Scholarship.This work was supported by a joint Queen Mary/China Scholarship Council Scholarship.This work was supported by a joint Queen Mary/China Scholarship Council Scholarship.This work was supported by a joint Queen Mary/China Scholarship Council Scholarship.In this thesis we exploit piano acoustics to automatically transcribe piano recordings into a symbolic representation: the pitch and timing of each detected note. To do so we use approaches based on non-negative matrix factorisation (NMF). To motivate the main contributions of this thesis, we provide two preparatory studies: a study of using a deterministic annealing EM algorithm in a matrix factorisation-based system, and a study of decay patterns of partials in real-word piano tones. Based on these studies, we propose two generative NMF-based models which explicitly model different piano acoustical features. The first is an attack/decay model, that takes into account the time-varying timbre and decaying energy of piano sounds. The system divides a piano note into percussive attack and harmonic decay stages, and separately models the two parts using two sets of templates and amplitude envelopes. The two parts are coupled by the note activations. We simplify the decay envelope by an exponentially decaying function. The proposed method improves the performance of supervised piano transcription. The second model aims at using the spectral width of partials as an independent indicator of the duration of piano notes. Each partial is represented by a Gaussian function, with the spectral width indicated by the standard deviation. The spectral width is large in the attack part, but gradually decreases to a stable value and remains constant in the decay part. The model provides a new aspect to understand the time-varying timbre of piano notes, but furtherinvestigation is needed to use it effectively to improve piano transcription. We demonstrate the utility of the proposed systems in piano music transcription and analysis. Results show that explicitly modelling piano acoustical features, especially temporal features, can improve the transcription performance.Queen Mary/China Scholarship Council Scholarship
An End-to-End Neural Network for Polyphonic Music Transcription
We present a neural network model for polyphonic music transcription. The architecture of the proposed model is analogous to speech recognition systems and comprises an acoustic model and a music language mode}. The acoustic model is a neural network used for estimating the probabilities of pitches in a frame of audio. The language model is a recurrent neural network that models the correlations between pitch combinations over time. The proposed model is general and can be used to transcribe polyphonic music without imposing any constraints on the polyphony or the number or type of instruments. The acoustic and language model predictions are combined using a probabilistic graphical model. Inference over the output variables is performed using the beam search algorithm. We investigate various neural network architectures for the acoustic models and compare their performance to two popular state-of-the-art acoustic models. We also present an efficient variant of beam search that improves performance and reduces run-times by an order of magnitude, making the model suitable for real-time applications. We evaluate the model's performance on the MAPS dataset and show that the proposed model outperforms state-of-the-art transcription systems
- …