41 research outputs found

    Audio source separation for music in low-latency and high-latency scenarios

    Get PDF
    Aquesta tesi proposa mètodes per tractar les limitacions de les tècniques existents de separació de fonts musicals en condicions de baixa i alta latència. En primer lloc, ens centrem en els mètodes amb un baix cost computacional i baixa latència. Proposem l'ús de la regularització de Tikhonov com a mètode de descomposició de l'espectre en el context de baixa latència. El comparem amb les tècniques existents en tasques d'estimació i seguiment dels tons, que són passos crucials en molts mètodes de separació. A continuació utilitzem i avaluem el mètode de descomposició de l'espectre en tasques de separació de veu cantada, baix i percussió. En segon lloc, proposem diversos mètodes d'alta latència que milloren la separació de la veu cantada, gràcies al modelatge de components específics, com la respiració i les consonants. Finalment, explorem l'ús de correlacions temporals i anotacions manuals per millorar la separació dels instruments de percussió i dels senyals musicals polifònics complexes.Esta tesis propone métodos para tratar las limitaciones de las técnicas existentes de separación de fuentes musicales en condiciones de baja y alta latencia. En primer lugar, nos centramos en los métodos con un bajo coste computacional y baja latencia. Proponemos el uso de la regularización de Tikhonov como método de descomposición del espectro en el contexto de baja latencia. Lo comparamos con las técnicas existentes en tareas de estimación y seguimiento de los tonos, que son pasos cruciales en muchos métodos de separación. A continuación utilizamos y evaluamos el método de descomposición del espectro en tareas de separación de voz cantada, bajo y percusión. En segundo lugar, proponemos varios métodos de alta latencia que mejoran la separación de la voz cantada, gracias al modelado de componentes que a menudo no se toman en cuenta, como la respiración y las consonantes. Finalmente, exploramos el uso de correlaciones temporales y anotaciones manuales para mejorar la separación de los instrumentos de percusión y señales musicales polifónicas complejas.This thesis proposes specific methods to address the limitations of current music source separation methods in low-latency and high-latency scenarios. First, we focus on methods with low computational cost and low latency. We propose the use of Tikhonov regularization as a method for spectrum decomposition in the low-latency context. We compare it to existing techniques in pitch estimation and tracking tasks, crucial steps in many separation methods. We then use the proposed spectrum decomposition method in low-latency separation tasks targeting singing voice, bass and drums. Second, we propose several high-latency methods that improve the separation of singing voice by modeling components that are often not accounted for, such as breathiness and consonants. Finally, we explore using temporal correlations and human annotations to enhance the separation of drums and complex polyphonic music signals

    Singing Voice Separation from Monaural Recordings using Archetypal Analysis

    Get PDF
    Ο διαχωρισμός τραγουδιστικής φωνής στοχεύει στο να διαχωρίσει το σήμα της τραγουδιστικής φωνής από το σήμα της μουσικής υπόκρουσης έχοντας ως είσοδο μουσικές ηχογραφήσεις. Η εργασία αυτή είναι ένας ακρογωνιαίος λίθος για πλήθος εργασιών που ανήκουν στην κατηγορία ”ανάκτηση μουσικής πληροφορίας” όπως για παράδειγμα αυτόματη αναγνώριση στίχων, αναγνώριση τραγουδιστή, εξόρυξη μελωδίας και ρεμίξ ήχου. Στη παρούσα διατριβή, διερευνούμε τον Διαχωρισμό τραγουδιστικής φωνής από μονοφωνικές ηχογραφήσεις εκμεταλλευόμενοι μεθόδους μη επιτηρούμενης μηχανικής μάθησης. Το κίνητρο πίσω από τις μεθόδους που χρησιμοποιήθηκαν είναι το γεγονός ότι η μουσική υπόκρουση τοποθετείται σε έναν χαμηλής-τάξης υπόχωρο λόγω του επαναλαμβανόμενου μοτίβου της, ενώ το πρότυπο της φωνής παρατηρείται ως αραιό μέσα σε ένα μουσικό κομμάτι. Συνεπώς, ανασυνθέτουμε ηχητικά φασματογραφήματα ως υπέρθεση χαμηλής-τάξης και αραιών συνιστωσών, αποτυπώνοντας τα φασματογραφήματα της μουσικής υπόκρουσης και τραγουδιστικής φωνής αντίστοιχα χρησιμοποιώντας τον αλγόριθμο Robust Principal Component Analysis. Επιπλέον, λαμβάνοντας υπόψη τη μη αρνητική φύση του μέτρου του ηχητικού φασματογραφήματος, αναπτύξαμε μία παραλλαγή της Αρχετυπικής Ανάλυσης με περιορισμούς αραιότητας στοχεύοντας να βελτιώσουμε τον διαχωρισμό. Αμφότερες οι μέθοδοι αξιολογήθηκαν στο σύνολο δεδομένων MIR-1K, το οποίο είναι κατασκευασμένο ειδικά για τον διαχωρισμό τραγουδιστικής φωνής. Τα πειραματικά αποτελέσματα δείχνουν πως και οι δύο μέθοδοι εκτελούν τον διαχωρισμό τραγουδιστικής φωνής επιτυχημένα και πετυχαίνουν στην μετρική GNSDR τιμή μεγαλύτερη των 3.0dB.Singing voice separation aims at separating the singing voice signal from the background music signal from music recordings. This task is a cornerstone for numerous MIR (Music Information Retrieval) tasks including automatic lyric recognition, singer identification, melody extraction and audio remixing. In this thesis, we investigate Singing voice separation from monaural recordings by exploiting unsupervised machine learning methods. The motivation behind the employed methods is the fact that music accompaniment lies in a low rank subspace due to its repeating motive and singing voice has a sparse pattern within the song. To this end, we decompose audio spectrograms as a superposition of low-rank components and sparse ones, capturing the spectrograms of background music and singing voice respectively using the Robust Principal Component Analysis algorithm. Furthermore, by considering the non-negative nature of the magnitude of audio spectrograms, we develop a variant of Archetypal Analysis with sparsity constraints aiming to improve the separation. Both methods are evaluated on MIR-1K dataset, which is designed especially for singing voice separation. Experimental evaluation confirms that both methods perform singing voice separation successfully and achieve a value above 3.0dB in GNSDR metric

    고유 특성을 활용한 음악에서의 보컬 분리

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 융합과학기술대학원 융합과학부, 2018. 2. 이교구.보컬 분리란 음악 신호를 보컬 성분과 반주 성분으로 분리하는 일 또는 그 방법을 의미한다. 이러한 기술은 음악의 특정한 성분에 담겨 있는 정보를 추출하기 위한 전처리 과정에서부터, 보컬 연습과 같이 분리 음원 자체를 활용하는 등의 다양한 목적으로 사용될 수 있다. 본 논문의 목적은 보컬과 반주가 가지고 있는 고유한 특성에 대해 논의하고 그것을 활용하여 보컬 분리 알고리즘들을 개발하는 것이며, 특히 `특징 기반' 이라고 불리는 다음과 같은 상황에 대해 중점적으로 논의한다. 우선 분리 대상이 되는 음악 신호는 단채널로 제공된다고 가정하며, 이 경우 신호의 공간적 정보를 활용할 수 있는 다채널 환경에 비해 더욱 어려운 환경이라고 볼 수 있다. 또한 기계 학습 방법으로 데이터로부터 각 음원의 모델을 추정하는 방법을 배제하며, 대신 저차원의 특성들로부터 모델을 유도하여 이를 목표 함수에 반영하는 방법을 시도한다. 마지막으로, 가사, 악보, 사용자의 안내 등과 같은 외부의 정보 역시 제공되지 않는다고 가정한다. 그러나 보컬 분리의 경우 암묵 음원 분리 문제와는 달리 분리하고자 하는 음원이 각각 보컬과 반주에 해당한다는 최소한의 정보는 제공되므로 각각의 성질들에 대한 분석은 가능하다. 크게 세 종류의 특성이 본 논문에서 중점적으로 논의된다. 우선 연속성의 경우 주파수 또는 시간 측면으로 각각 논의될 수 있는데, 주파수축 연속성의 경우 소리의 음색적 특성을, 시간축 연속성은 소리가 안정적으로 지속되는 정도를 각각 나타낸다고 볼 수 있다. 또한, 저행렬계수 특성은 신호의 구조적 성질을 반영하며 해당 신호가 낮은 행렬계수를 가지는 형태로 표현될 수 있는지를 나타내며, 성김 특성은 신호의 분포 형태가 얼마나 성기거나 조밀한지를 나타낸다. 본 논문에서는 크게 두 가지의 보컬 분리 방법에 대해 논의한다. 첫 번째 방법은 연속성과 성김 특성에 기반을 두고 화성 악기-타악기 분리 방법 (harmonic-percussive sound separation, HPSS) 을 확장하는 방법이다. 기존의 방법이 두 번의 HPSS 과정을 통해 보컬을 분리하는 것에 비해 제안하는 방법은 성긴 잔여 성분을 추가해 한 번의 보컬 분리 과정만을 사용한다. 논의되는 다른 방법은 저행렬계수 특성과 성김 특성을 활용하는 것으로, 반주가 저행렬계수 모델로 표현될 수 있는 반면 보컬은 성긴 분포를 가진다는 가정에 기반을 둔다. 이러한 성분들을 분리하기 위해 강인한 주성분 분석 (robust principal component analysis, RPCA) 을 이용하는 방법이 대표적이다. 본 논문에서는 보컬 분리 성능에 초점을 두고 RPCA 알고리즘을 일반화하거나 확장하는 방식에 대해 논의하며, 트레이스 노름과 l1 노름을 각각 샤텐 p 노름과 lp 노름으로 대체하는 방법, 스케일 압축 방법, 주파수 분포 특성을 반영하는 방법 등을 포함한다. 제안하는 알고리즘들은 다양한 데이터셋과 대회에서 평가되었으며 최신의 보컬 분리 알고리즘들보다 더 우수하거나 비슷한 결과를 보였다.Singing voice separation (SVS) refers to the task or the method of decomposing music signal into singing voice and its accompanying instruments. It has various uses, from the preprocessing step, to extract the musical features implied in the target source, to applications for itself such as vocal training. This thesis aims to discover the common properties of singing voice and accompaniment, and apply it to advance the state-of-the-art SVS algorithms. In particular, the separation approach as follows, which is named `characteristics-based,' is concentrated in this thesis. First, the music signal is assumed to be provided in monaural, or as a single-channel recording. It is more difficult condition compared to multiple-channel recording since spatial information cannot be applied in the separation procedure. This thesis also focuses on unsupervised approach, that does not use machine learning technique to estimate the source model from the training data. The models are instead derived based on the low-level characteristics and applied to the objective function. Finally, no external information such as lyrics, score, or user guide is provided. Unlike blind source separation problems, however, the classes of the target sources, singing voice and accompaniment, are known in SVS problem, and it allows to estimate those respective properties. Three different characteristics are primarily discussed in this thesis. Continuity, in the spectral or temporal dimension, refers the smoothness of the source in the particular aspect. The spectral continuity is related with the timbre, while the temporal continuity represents the stability of sounds. On the other hand, the low-rankness refers how the signal is well-structured and can be represented as a low-rank data, and the sparsity represents how rarely the sounds in signals occur in time and frequency. This thesis discusses two SVS approaches using above characteristics. First one is based on the continuity and sparsity, which extends the harmonic-percussive sound separation (HPSS). While the conventional algorithm separates singing voice by using a two-stage HPSS, the proposed one has a single stage procedure but with an additional sparse residual term in the objective function. Another SVS approach is based on the low-rankness and sparsity. Assuming that accompaniment can be represented as a low-rank model, whereas singing voice has a sparse distribution, conventional algorithm decomposes the sources by using robust principal component analysis (RPCA). In this thesis, generalization or extension of RPCA especially for SVS is discussed, including the use of Schatten p-/lp-norm, scale compression, and spectral distribution. The presented algorithms are evaluated using various datasets and challenges and achieved the better comparable results compared to the state-of-the-art algorithms.Chapter 1 Introduction 1 1.1 Motivation 4 1.2 Applications 5 1.3 Definitions and keywords 6 1.4 Evaluation criteria 7 1.5 Topics of interest 11 1.6 Outline of the thesis 13 Chapter 2 Background 15 2.1 Spectrogram-domain separation framework 15 2.2 Approaches for singing voice separation 19 2.2.1 Characteristics-based approach 20 2.2.2 Spatial approach 21 2.2.3 Machine learning-based approach 22 2.2.4 informed approach 23 2.3 Datasets and challenges 25 2.3.1 Datasets 25 2.3.2 Challenges 26 Chapter 3 Characteristics of music sources 28 3.1 Introduction 28 3.2 Spectral/temporal continuity 29 3.2.1 Continuity of a spectrogram 29 3.2.2 Continuity of musical sources 30 3.3 Low-rankness 31 3.3.1 Low-rankness of a spectrogram 31 3.3.2 Low-rankness of musical sources 33 3.4 Sparsity 34 3.4.1 Sparsity of a spectrogram 34 3.4.2 Sparsity of musical sources 36 3.5 Experiments 38 3.6 Summary 39 Chapter 4 Singing voice separation using continuity and sparsity 43 4.1 Introduction 43 4.2 SVS using two-stage HPSS 45 4.2.1 Harmonic-percussive sound separation 45 4.2.2 SVS using two-stage HPSS 46 4.3 Proposed algorithm 48 4.4 Experimental evaluation 52 4.4.1 MIR-1k Dataset 52 4.4.2 Beach boys Dataset 55 4.4.3 iKala dataset in MIREX 2014 56 4.5 Conclusion 58 Chapter 5 Singing voice separation using low-rankness and sparsity 61 5.1 Introduction 61 5.2 SVS using robust principal component analysis 63 5.2.1 Robust principal component analysis 63 5.2.2 Optimization for RPCA using augmented Lagrangian multiplier method 63 5.2.3 SVS using RPCA 65 5.3 SVS using generalized RPCA 67 5.3.1 Generalized RPCA using Schatten p- and lp-norm 67 5.3.2 Comparison of pRPCA with robust matrix completion 68 5.3.3 Optimization method of pRPCA 69 5.3.4 Discussion of the normalization factor for λ 69 5.3.5 Generalized RPCA using scale compression 71 5.3.6 Experimental results 72 5.4 SVS using RPCA and spectral distribution 73 5.4.1 RPCA with weighted l1-norm 73 5.4.2 Proposed method: SVS using wRPCA 74 5.4.3 Experimental results using DSD100 dataset 78 5.4.4 Comparison with state-of-the-arts in SiSEC 2016 79 5.4.5 Discussion 85 5.5 Summary 86 Chapter 6 Conclusion and Future Work 88 6.1 Conclusion 88 6.2 Contributions 89 6.3 Future work 91 6.3.1 Discovering various characteristics for SVS 91 6.3.2 Expanding to other SVS approaches 92 6.3.3 Applying the characteristics for deep learning models 92 Bibliography 94 초 록 110Docto

    Score-Informed Source Separation for Music Signals

    Get PDF
    In recent years, the processing of audio recordings by exploiting additional musical knowledge has turned out to be a promising research direction. In particular, additional note information as specified by a musical score or a MIDI file has been employed to support various audio processing tasks such as source separation, audio parameterization, performance analysis, or instrument equalization. In this contribution, we provide an overview of approaches for score-informed source separation and illustrate their potential by discussing innovative applications and interfaces. Additionally, to illustrate some basic principles behind these approaches, we demonstrate how score information can be integrated into the well-known non-negative matrix factorization (NMF) framework. Finally, we compare this approach to advanced methods based on parametric models

    Music Information Retrieval: An Inspirational Guide to Transfer from Related Disciplines

    Get PDF
    The emerging field of Music Information Retrieval (MIR) has been influenced by neighboring domains in signal processing and machine learning, including automatic speech recognition, image processing and text information retrieval. In this contribution, we start with concrete examples for methodology transfer between speech and music processing, oriented on the building blocks of pattern recognition: preprocessing, feature extraction, and classification/decoding. We then assume a higher level viewpoint when describing sources of mutual inspiration derived from text and image information retrieval. We conclude that dealing with the peculiarities of music in MIR research has contributed to advancing the state-of-the-art in other fields, and that many future challenges in MIR are strikingly similar to those that other research areas have been facing

    Principled methods for mixtures processing

    Get PDF
    This document is my thesis for getting the habilitation à diriger des recherches, which is the french diploma that is required to fully supervise Ph.D. students. It summarizes the research I did in the last 15 years and also provides the short­term research directions and applications I want to investigate. Regarding my past research, I first describe the work I did on probabilistic audio modeling, including the separation of Gaussian and α­stable stochastic processes. Then, I mention my work on deep learning applied to audio, which rapidly turned into a large effort for community service. Finally, I present my contributions in machine learning, with some works on hardware compressed sensing and probabilistic generative models.My research programme involves a theoretical part that revolves around probabilistic machine learning, and an applied part that concerns the processing of time series arising in both audio and life sciences

    Singing Voice Recognition for Music Information Retrieval

    Get PDF
    This thesis proposes signal processing methods for analysis of singing voice audio signals, with the objectives of obtaining information about the identity and lyrics content of the singing. Two main topics are presented, singer identification in monophonic and polyphonic music, and lyrics transcription and alignment. The information automatically extracted from the singing voice is meant to be used for applications such as music classification, sorting and organizing music databases, music information retrieval, etc. For singer identification, the thesis introduces methods from general audio classification and specific methods for dealing with the presence of accompaniment. The emphasis is on singer identification in polyphonic audio, where the singing voice is present along with musical accompaniment. The presence of instruments is detrimental to voice identification performance, and eliminating the effect of instrumental accompaniment is an important aspect of the problem. The study of singer identification is centered around the degradation of classification performance in presence of instruments, and separation of the vocal line for improving performance. For the study, monophonic singing was mixed with instrumental accompaniment at different signal-to-noise (singing-to-accompaniment) ratios and the classification process was performed on the polyphonic mixture and on the vocal line separated from the polyphonic mixture. The method for classification including the step for separating the vocals is improving significantly the performance compared to classification of the polyphonic mixtures, but not close to the performance in classifying the monophonic singing itself. Nevertheless, the results show that classification of singing voices can be done robustly in polyphonic music when using source separation. In the problem of lyrics transcription, the thesis introduces the general speech recognition framework and various adjustments that can be done before applying the methods on singing voice. The variability of phonation in singing poses a significant challenge to the speech recognition approach. The thesis proposes using phoneme models trained on speech data and adapted to singing voice characteristics for the recognition of phonemes and words from a singing voice signal. Language models and adaptation techniques are an important aspect of the recognition process. There are two different ways of recognizing the phonemes in the audio: one is alignment, when the true transcription is known and the phonemes have to be located, other one is recognition, when both transcription and location of phonemes have to be found. The alignment is, obviously, a simplified form of the recognition task. Alignment of textual lyrics to music audio is performed by aligning the phonetic transcription of the lyrics with the vocal line separated from the polyphonic mixture, using a collection of commercial songs. The word recognition is tested for transcription of lyrics from monophonic singing. The performance of the proposed system for automatic alignment of lyrics and audio is sufficient for facilitating applications such as automatic karaoke annotation or song browsing. The word recognition accuracy of the lyrics transcription from singing is quite low, but it is shown to be useful in a query-by-singing application, for performing a textual search based on the words recognized from the query. When some key words in the query are recognized, the song can be reliably identified

    Variational Bayesian Inference for Source Separation and Robust Feature Extraction

    Get PDF
    International audienceWe consider the task of separating and classifying individual sound sources mixed together. The main challenge is to achieve robust classification despite residual distortion of the separated source signals. A promising paradigm is to estimate the uncertainty about the separated source signals and to propagate it through the subsequent feature extraction and classification stages. We argue that variational Bayesian (VB) inference offers a mathematically rigorous way of deriving uncertainty estimators, which contrasts with state-of-the-art estimators based on heuristics or on maximum likelihood (ML) estimation. We propose a general VB source separation algorithm, which makes it possible to jointly exploit spatial and spectral models of the sources. This algorithm achieves 6% and 5% relative error reduction compared to ML uncertainty estimation on the CHiME noise-robust speaker identification and speech recognition benchmarks, respectively, and it opens the way for more complex VB approximations of uncertainty.Dans cet article, nous considérons le problème de l'extraction des descripteurs de chaque source dans un enregistrement audio multi-sources à l'aide d'un algorithme général de séparation de sources. La difficulté consiste à estimer l'incertitude sur les sources et à la propager aux descripteurs, afin de les estimer de façon robuste en dépit des erreurs de séparation. Les méthodes de l'état de l'art estiment l'incertitude de façon heuristique, tandis que nous proposons d'intégrer sur les paramètres de l'algorithme de séparation de sources. Nous décrivons dans ce but une méthode d'inférence variationnelle bayésienne pour l'estimation de la distribution a posteriori des sources et nous calculons ensuite l'espérance des descripteurs par propagation de l'incertitude selon la méthode d'identification des moments. Nous évaluons la précision des descripteurs en terme d'erreur quadratique moyenne et conduisons des expériences de reconnaissance du locuteur afin d'observer la performance qui en découle pour un problème réel. Dans les deux cas, la méthode proposée donne les meilleurs résultats

    Signal Processing Methods for Music Synchronization, Audio Matching, and Source Separation

    Get PDF
    The field of music information retrieval (MIR) aims at developing techniques and tools for organizing, understanding, and searching multimodal information in large music collections in a robust, efficient and intelligent manner. In this context, this thesis presents novel, content-based methods for music synchronization, audio matching, and source separation. In general, music synchronization denotes a procedure which, for a given position in one representation of a piece of music, determines the corresponding position within another representation. Here, the thesis presents three complementary synchronization approaches, which improve upon previous methods in terms of robustness, reliability, and accuracy. The first approach employs a late-fusion strategy based on multiple, conceptually different alignment techniques to identify those music passages that allow for reliable alignment results. The second approach is based on the idea of employing musical structure analysis methods in the context of synchronization to derive reliable synchronization results even in the presence of structural differences between the versions to be aligned. Finally, the third approach employs several complementary strategies for increasing the accuracy and time resolution of synchronization results. Given a short query audio clip, the goal of audio matching is to automatically retrieve all musically similar excerpts in different versions and arrangements of the same underlying piece of music. In this context, chroma-based audio features are a well-established tool as they possess a high degree of invariance to variations in timbre. This thesis describes a novel procedure for making chroma features even more robust to changes in timbre while keeping their discriminative power. Here, the idea is to identify and discard timbre-related information using techniques inspired by the well-known MFCC features, which are usually employed in speech processing. Given a monaural music recording, the goal of source separation is to extract musically meaningful sound sources corresponding, for example, to a melody, an instrument, or a drum track from the recording. To facilitate this complex task, one can exploit additional information provided by a musical score. Based on this idea, this thesis presents two novel, conceptually different approaches to source separation. Using score information provided by a given MIDI file, the first approach employs a parametric model to describe a given audio recording of a piece of music. The resulting model is then used to extract sound sources as specified by the score. As a computationally less demanding and easier to implement alternative, the second approach employs the additional score information to guide a decomposition based on non-negative matrix factorization (NMF)
    corecore