27 research outputs found
Automatic music transcription: challenges and future directions
Automatic music transcription is considered by many to be a key enabling technology in music signal processing. However, the performance of transcription systems is still significantly below that of a human expert, and accuracies reported in recent years seem to have reached a limit, although the field is still very active. In this paper we analyse limitations of current methods and identify promising directions for future research. Current transcription methods use general purpose models which are unable to capture the rich diversity found in music signals. One way to overcome the limited performance of transcription systems is to tailor algorithms to specific use-cases. Semi-automatic approaches are another way of achieving a more reliable transcription. Also, the wealth of musical scores and corresponding audio data now available are a rich potential source of training data, via forced alignment of audio to scores, but large scale utilisation of such data has yet to be attempted. Other promising approaches include the integration of information from multiple algorithms and different musical aspects
Automatic transcription of polyphonic music exploiting temporal evolution
PhDAutomatic music transcription is the process of converting an audio recording
into a symbolic representation using musical notation. It has numerous applications
in music information retrieval, computational musicology, and the
creation of interactive systems. Even for expert musicians, transcribing polyphonic
pieces of music is not a trivial task, and while the problem of automatic
pitch estimation for monophonic signals is considered to be solved, the creation
of an automated system able to transcribe polyphonic music without setting
restrictions on the degree of polyphony and the instrument type still remains
open.
In this thesis, research on automatic transcription is performed by explicitly
incorporating information on the temporal evolution of sounds. First efforts address
the problem by focusing on signal processing techniques and by proposing
audio features utilising temporal characteristics. Techniques for note onset and
offset detection are also utilised for improving transcription performance. Subsequent
approaches propose transcription models based on shift-invariant probabilistic
latent component analysis (SI-PLCA), modeling the temporal evolution
of notes in a multiple-instrument case and supporting frequency modulations in
produced notes. Datasets and annotations for transcription research have also
been created during this work. Proposed systems have been privately as well as
publicly evaluated within the Music Information Retrieval Evaluation eXchange
(MIREX) framework. Proposed systems have been shown to outperform several
state-of-the-art transcription approaches.
Developed techniques have also been employed for other tasks related to music
technology, such as for key modulation detection, temperament estimation,
and automatic piano tutoring. Finally, proposed music transcription models
have also been utilized in a wider context, namely for modeling acoustic scenes
Multi-Pitch Estimation Exploiting Block Sparsity
We study the problem of estimating the fundamental frequencies of a signal containing multiple harmonically related sinusoidal components using a novel block sparse signal representation. An efficient algorithm for solving the resulting optimization problem is devised exploiting a novel variable step-size alternating direction method of multipliers (ADMM). The resulting algorithm has guaranteed convergence and shows notable robustness to the f 0 vs f0/2f0/2 ambiguity problem. The superiority of the proposed method, as compared to earlier presented estimation techniques, is demonstrated using both simulated and measured audio signals, clearly indicating the preferable performance of the proposed technique
Audio source separation for music in low-latency and high-latency scenarios
Aquesta tesi proposa mètodes per tractar les limitacions de les tècniques existents de separaciĂł de fonts musicals en condicions de baixa i alta latència. En primer lloc, ens centrem en els mètodes amb un baix cost computacional i baixa latència. Proposem l'Ăşs de la regularitzaciĂł de Tikhonov com a mètode de descomposiciĂł de l'espectre en el context de baixa latència. El comparem amb les tècniques existents en tasques d'estimaciĂł i seguiment dels tons, que sĂłn passos crucials en molts mètodes de separaciĂł. A continuaciĂł utilitzem i avaluem el mètode de descomposiciĂł de l'espectre en tasques de separaciĂł de veu cantada, baix i percussiĂł. En segon lloc, proposem diversos mètodes d'alta latència que milloren la separaciĂł de la veu cantada, grĂ cies al modelatge de components especĂfics, com la respiraciĂł i les consonants. Finalment, explorem l'Ăşs de correlacions temporals i anotacions manuals per millorar la separaciĂł dels instruments de percussiĂł i dels senyals musicals polifònics complexes.Esta tesis propone mĂ©todos para tratar las limitaciones de las tĂ©cnicas existentes de separaciĂłn de fuentes musicales en condiciones de baja y alta latencia. En primer lugar, nos centramos en los mĂ©todos con un bajo coste computacional y baja latencia. Proponemos el uso de la regularizaciĂłn de Tikhonov como mĂ©todo de descomposiciĂłn del espectro en el contexto de baja latencia. Lo comparamos con las tĂ©cnicas existentes en tareas de estimaciĂłn y seguimiento de los tonos, que son pasos cruciales en muchos mĂ©todos de separaciĂłn. A continuaciĂłn utilizamos y evaluamos el mĂ©todo de descomposiciĂłn del espectro en tareas de separaciĂłn de voz cantada, bajo y percusiĂłn. En segundo lugar, proponemos varios mĂ©todos de alta latencia que mejoran la separaciĂłn de la voz cantada, gracias al modelado de componentes que a menudo no se toman en cuenta, como la respiraciĂłn y las consonantes. Finalmente, exploramos el uso de correlaciones temporales y anotaciones manuales para mejorar la separaciĂłn de los instrumentos de percusiĂłn y señales musicales polifĂłnicas complejas.This thesis proposes specific methods to address the limitations of current music source separation methods in low-latency and high-latency scenarios. First, we focus on methods with low computational cost and low latency. We propose the use of Tikhonov regularization as a method for spectrum decomposition in the low-latency context. We compare it to existing techniques in pitch estimation and tracking tasks, crucial steps in many separation methods. We then use the proposed spectrum decomposition method in low-latency separation tasks targeting singing voice, bass and drums. Second, we propose several high-latency methods that improve the separation of singing voice by modeling components that are often not accounted for, such as breathiness and consonants. Finally, we explore using temporal correlations and human annotations to enhance the separation of drums and complex polyphonic music signals
BaNa: a noise resilient fundamental frequency detection algorithm for speech and music
Fundamental frequency (F0) is one of the essential features in many acoustic related applications. Although numerous F0 detection algorithms have been developed, the detection accuracy in noisy environments still needs improvement. We present a hybrid noise resilient F0 detection algorithm named BaNa that combines the approaches of harmonic ratios and Cepstrum analysis. A Viterbi algorithm with a cost function is used to identify the F0 value among several F0 candidates. Speech and music databases with eight different types of additive noise are used to evaluate the performance of the BaNa algorithm and several classic and state-of-the-art F0 detection algorithms. Results show that for almost all types of noise and signal-to-noise ratio (SNR) values investigated, BaNa achieves the lowest Gross Pitch Error (GPE) rate among all the algorithms. Moreover, for the 0 dB SNR scenarios, the BaNa algorithm is shown to achieve 20% to 35% GPE rate for speech and 12% to 39% GPE rate for music. We also describe implementation issues that must be addressed to run the BaNa algorithm as a real-time application on a smartphone platform.Peer ReviewedPostprint (author's final draft
Constrained Nonnegative Matrix Factorization with Applications to Music Transcription
In this work we explore using nonnegative matrix factorization (NMF) for music transcription, as well as several other applications. NMF is an unsupervised learning method capable of finding a parts-based additive model of data. Since music has an additive property (each time point in a musical piece is composed of a sum of notes) NMF is a natural fit for analysis. NMF is able to exploit this additivity in order to factorize out both the individual notes and the transcription from an audio sample.
In order to improve the performance of NMF we apply different constraints to the model. We consider sparsity as well as piecewise smoothness with aligned breakpoints. We show the novelty of our method on real music data and demonstrate promising results which exceed the current state of the art. Other applications are also considered, such as instrument and speaker separation and handwritten character analysis
Real-time detection of overlapping sound events with non-negative matrix factorization
International audienceIn this paper, we investigate the problem of real-time detection of overlapping sound events by employing non-negative matrix factorization techniques. We consider a setup where audio streams arrive in real-time to the system and are decomposed onto a dictionary of event templates learned off-line prior to the decomposition. An important drawback of existing approaches in this context is the lack of controls on the decomposition. We propose and compare two provably convergent algorithms that address this issue, by controlling respectively the sparsity of the decomposition and the trade-off of the decomposition between the different frequency components. Sparsity regularization is considered in the framework of convex quadratic programming, while frequency compromise is introduced by employing the beta-divergence as a cost function. The two algorithms are evaluated on the multi-source detection tasks of polyphonic music transcription, drum transcription and environmental sound recognition. The obtained results show how the proposed approaches can improve detection in such applications, while maintaining low computational costs that are suitable for real-time
Joint DOA and Multi-Pitch Estimation Using Block Sparsity
In this paper, we propose a novel method to estimate the fundamental frequencies and directions-of-arrival (DOA) of multi-pitch signals impinging on a sensor array. Formulating the estimation as a group sparse convex optimization problem, we use the alternating direction of multipliers method (ADMM) to estimate both temporal and spatial correlation of the array signal. By first jointly estimating both fundamental frequencies and time-of-arrivals (TOAs) for each sensor and sound source, we then form a non-linear least squares estimate to obtain the DOAs. Numerical simulations indcate the preferable performance of the proposed estimator as compared to current state-of-the-art methods