11 research outputs found

    Raw Multi-Channel Audio Source Separation using Multi-Resolution Convolutional Auto-Encoders

    Get PDF
    Supervised multi-channel audio source separation requires extracting useful spectral, temporal, and spatial features from the mixed signals. The success of many existing systems is therefore largely dependent on the choice of features used for training. In this work, we introduce a novel multi-channel, multi-resolution convolutional auto-encoder neural network that works on raw time-domain signals to determine appropriate multi-resolution features for separating the singing-voice from stereo music. Our experimental results show that the proposed method can achieve multi-channel audio source separation without the need for hand-crafted features or any pre- or post-processing

    The 2015 Signal Separation Evaluation Campaign

    Get PDF
    International audienceIn this paper, we report the 2015 community-based Signal Separation Evaluation Campaign (SiSEC 2015). This SiSEC consists of four speech and music datasets including two new datasets: " Professionally produced music recordings " and " Asynchronous recordings of speech mixtures ". Focusing on them, we overview the campaign specifications such as the tasks, datasets and evaluation criteria. We also summarize the performance of the submitted systems

    Multichannel Music Separation with Deep Neural Networks

    Get PDF
    International audienceThis article addresses the problem of multichannel music separation. We propose a framework where the source spectra are estimated using deep neural networks and combined with spatial covariance matrices to encode the source spatial characteristics. The parameters are estimated in an iterative expectation-maximization fashion and used to derive a multichannel Wiener filter. We evaluate the proposed framework for the task of music separation on a large dataset. Experimental results show that the method we describe performs consistently well in separating singing voice and other instruments from realistic musical mixtures

    Applying source separation to music

    Get PDF
    International audienceSeparation of existing audio into remixable elements is very useful to repurpose music audio. Applications include upmixing video soundtracks to surround sound (e.g. home theater 5.1 systems), facilitating music transcriptions, allowing better mashups and remixes for disk jockeys, and rebalancing sound levels on multiple instruments or voices recorded simultaneously to a single track. In this chapter, we provide an overview of the algorithms and approaches designed to address the challenges and opportunities in music. Where applicable, we also introduce commonalities and links to source separation for video soundtracks, since many musical scenarios involve video soundtracks (e.g. YouTube recordings of live concerts, movie sound tracks). While space prohibits describing every method in detail, we include detail on representative music‐specific algorithms and approaches not covered in other chapters. The intent is to give the reader a high‐level understanding of the workings of key exemplars of the source separation approaches applied in this domain

    Separación de voz de muestras multifuente

    Get PDF
    En los últimos años, el número de aplicaciones informáticas que procesa la voz humana, con diversos fines, ha aumentado sensiblemente. Cualquiera de esas aplicaciones, para su correcto funcionamiento, necesita disponer de la voz lo más intacta posible. En este proyecto se desarrolla un sistema utilizando redes neuronales, con el propósito de extraer la voz humana de cualquier audio, eliminado en el proceso los demás sonidos encontrados.In recent years, the number of computer applications which process the human voice has increased sensibly. Any of these applications need the cleanest voice possible for their proper functioning. In this project, a system is developed using neural networks with the purpose of extracting the human voice from any audio, erasing in the process any other sound found.Grado en Ingeniería Informátic

    고유 특성을 활용한 음악에서의 보컬 분리

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 융합과학기술대학원 융합과학부, 2018. 2. 이교구.보컬 분리란 음악 신호를 보컬 성분과 반주 성분으로 분리하는 일 또는 그 방법을 의미한다. 이러한 기술은 음악의 특정한 성분에 담겨 있는 정보를 추출하기 위한 전처리 과정에서부터, 보컬 연습과 같이 분리 음원 자체를 활용하는 등의 다양한 목적으로 사용될 수 있다. 본 논문의 목적은 보컬과 반주가 가지고 있는 고유한 특성에 대해 논의하고 그것을 활용하여 보컬 분리 알고리즘들을 개발하는 것이며, 특히 `특징 기반' 이라고 불리는 다음과 같은 상황에 대해 중점적으로 논의한다. 우선 분리 대상이 되는 음악 신호는 단채널로 제공된다고 가정하며, 이 경우 신호의 공간적 정보를 활용할 수 있는 다채널 환경에 비해 더욱 어려운 환경이라고 볼 수 있다. 또한 기계 학습 방법으로 데이터로부터 각 음원의 모델을 추정하는 방법을 배제하며, 대신 저차원의 특성들로부터 모델을 유도하여 이를 목표 함수에 반영하는 방법을 시도한다. 마지막으로, 가사, 악보, 사용자의 안내 등과 같은 외부의 정보 역시 제공되지 않는다고 가정한다. 그러나 보컬 분리의 경우 암묵 음원 분리 문제와는 달리 분리하고자 하는 음원이 각각 보컬과 반주에 해당한다는 최소한의 정보는 제공되므로 각각의 성질들에 대한 분석은 가능하다. 크게 세 종류의 특성이 본 논문에서 중점적으로 논의된다. 우선 연속성의 경우 주파수 또는 시간 측면으로 각각 논의될 수 있는데, 주파수축 연속성의 경우 소리의 음색적 특성을, 시간축 연속성은 소리가 안정적으로 지속되는 정도를 각각 나타낸다고 볼 수 있다. 또한, 저행렬계수 특성은 신호의 구조적 성질을 반영하며 해당 신호가 낮은 행렬계수를 가지는 형태로 표현될 수 있는지를 나타내며, 성김 특성은 신호의 분포 형태가 얼마나 성기거나 조밀한지를 나타낸다. 본 논문에서는 크게 두 가지의 보컬 분리 방법에 대해 논의한다. 첫 번째 방법은 연속성과 성김 특성에 기반을 두고 화성 악기-타악기 분리 방법 (harmonic-percussive sound separation, HPSS) 을 확장하는 방법이다. 기존의 방법이 두 번의 HPSS 과정을 통해 보컬을 분리하는 것에 비해 제안하는 방법은 성긴 잔여 성분을 추가해 한 번의 보컬 분리 과정만을 사용한다. 논의되는 다른 방법은 저행렬계수 특성과 성김 특성을 활용하는 것으로, 반주가 저행렬계수 모델로 표현될 수 있는 반면 보컬은 성긴 분포를 가진다는 가정에 기반을 둔다. 이러한 성분들을 분리하기 위해 강인한 주성분 분석 (robust principal component analysis, RPCA) 을 이용하는 방법이 대표적이다. 본 논문에서는 보컬 분리 성능에 초점을 두고 RPCA 알고리즘을 일반화하거나 확장하는 방식에 대해 논의하며, 트레이스 노름과 l1 노름을 각각 샤텐 p 노름과 lp 노름으로 대체하는 방법, 스케일 압축 방법, 주파수 분포 특성을 반영하는 방법 등을 포함한다. 제안하는 알고리즘들은 다양한 데이터셋과 대회에서 평가되었으며 최신의 보컬 분리 알고리즘들보다 더 우수하거나 비슷한 결과를 보였다.Singing voice separation (SVS) refers to the task or the method of decomposing music signal into singing voice and its accompanying instruments. It has various uses, from the preprocessing step, to extract the musical features implied in the target source, to applications for itself such as vocal training. This thesis aims to discover the common properties of singing voice and accompaniment, and apply it to advance the state-of-the-art SVS algorithms. In particular, the separation approach as follows, which is named `characteristics-based,' is concentrated in this thesis. First, the music signal is assumed to be provided in monaural, or as a single-channel recording. It is more difficult condition compared to multiple-channel recording since spatial information cannot be applied in the separation procedure. This thesis also focuses on unsupervised approach, that does not use machine learning technique to estimate the source model from the training data. The models are instead derived based on the low-level characteristics and applied to the objective function. Finally, no external information such as lyrics, score, or user guide is provided. Unlike blind source separation problems, however, the classes of the target sources, singing voice and accompaniment, are known in SVS problem, and it allows to estimate those respective properties. Three different characteristics are primarily discussed in this thesis. Continuity, in the spectral or temporal dimension, refers the smoothness of the source in the particular aspect. The spectral continuity is related with the timbre, while the temporal continuity represents the stability of sounds. On the other hand, the low-rankness refers how the signal is well-structured and can be represented as a low-rank data, and the sparsity represents how rarely the sounds in signals occur in time and frequency. This thesis discusses two SVS approaches using above characteristics. First one is based on the continuity and sparsity, which extends the harmonic-percussive sound separation (HPSS). While the conventional algorithm separates singing voice by using a two-stage HPSS, the proposed one has a single stage procedure but with an additional sparse residual term in the objective function. Another SVS approach is based on the low-rankness and sparsity. Assuming that accompaniment can be represented as a low-rank model, whereas singing voice has a sparse distribution, conventional algorithm decomposes the sources by using robust principal component analysis (RPCA). In this thesis, generalization or extension of RPCA especially for SVS is discussed, including the use of Schatten p-/lp-norm, scale compression, and spectral distribution. The presented algorithms are evaluated using various datasets and challenges and achieved the better comparable results compared to the state-of-the-art algorithms.Chapter 1 Introduction 1 1.1 Motivation 4 1.2 Applications 5 1.3 Definitions and keywords 6 1.4 Evaluation criteria 7 1.5 Topics of interest 11 1.6 Outline of the thesis 13 Chapter 2 Background 15 2.1 Spectrogram-domain separation framework 15 2.2 Approaches for singing voice separation 19 2.2.1 Characteristics-based approach 20 2.2.2 Spatial approach 21 2.2.3 Machine learning-based approach 22 2.2.4 informed approach 23 2.3 Datasets and challenges 25 2.3.1 Datasets 25 2.3.2 Challenges 26 Chapter 3 Characteristics of music sources 28 3.1 Introduction 28 3.2 Spectral/temporal continuity 29 3.2.1 Continuity of a spectrogram 29 3.2.2 Continuity of musical sources 30 3.3 Low-rankness 31 3.3.1 Low-rankness of a spectrogram 31 3.3.2 Low-rankness of musical sources 33 3.4 Sparsity 34 3.4.1 Sparsity of a spectrogram 34 3.4.2 Sparsity of musical sources 36 3.5 Experiments 38 3.6 Summary 39 Chapter 4 Singing voice separation using continuity and sparsity 43 4.1 Introduction 43 4.2 SVS using two-stage HPSS 45 4.2.1 Harmonic-percussive sound separation 45 4.2.2 SVS using two-stage HPSS 46 4.3 Proposed algorithm 48 4.4 Experimental evaluation 52 4.4.1 MIR-1k Dataset 52 4.4.2 Beach boys Dataset 55 4.4.3 iKala dataset in MIREX 2014 56 4.5 Conclusion 58 Chapter 5 Singing voice separation using low-rankness and sparsity 61 5.1 Introduction 61 5.2 SVS using robust principal component analysis 63 5.2.1 Robust principal component analysis 63 5.2.2 Optimization for RPCA using augmented Lagrangian multiplier method 63 5.2.3 SVS using RPCA 65 5.3 SVS using generalized RPCA 67 5.3.1 Generalized RPCA using Schatten p- and lp-norm 67 5.3.2 Comparison of pRPCA with robust matrix completion 68 5.3.3 Optimization method of pRPCA 69 5.3.4 Discussion of the normalization factor for λ 69 5.3.5 Generalized RPCA using scale compression 71 5.3.6 Experimental results 72 5.4 SVS using RPCA and spectral distribution 73 5.4.1 RPCA with weighted l1-norm 73 5.4.2 Proposed method: SVS using wRPCA 74 5.4.3 Experimental results using DSD100 dataset 78 5.4.4 Comparison with state-of-the-arts in SiSEC 2016 79 5.4.5 Discussion 85 5.5 Summary 86 Chapter 6 Conclusion and Future Work 88 6.1 Conclusion 88 6.2 Contributions 89 6.3 Future work 91 6.3.1 Discovering various characteristics for SVS 91 6.3.2 Expanding to other SVS approaches 92 6.3.3 Applying the characteristics for deep learning models 92 Bibliography 94 초 록 110Docto

    Pitch-Informed Solo and Accompaniment Separation

    Get PDF
    Das Thema dieser Dissertation ist die Entwicklung eines Systems zur Tonhöhen-informierten Quellentrennung von Musiksignalen in Soloinstrument und Begleitung. Dieses ist geeignet, die dominanten Instrumente aus einem Musikstück zu isolieren, unabhängig von der Art des Instruments, der Begleitung und Stilrichtung. Dabei werden nur einstimmige Melodieinstrumente in Betracht gezogen. Die Musikaufnahmen liegen monaural vor, es kann also keine zusätzliche Information aus der Verteilung der Instrumente im Stereo-Panorama gewonnen werden. Die entwickelte Methode nutzt Tonhöhen-Information als Basis für eine sinusoidale Modellierung der spektralen Eigenschaften des Soloinstruments aus dem Musikmischsignal. Anstatt die spektralen Informationen pro Frame zu bestimmen, werden in der vorgeschlagenen Methode Tonobjekte für die Separation genutzt. Tonobjekt-basierte Verarbeitung ermöglicht es, zusätzlich die Notenanfänge zu verfeinern, transiente Artefakte zu reduzieren, gemeinsame Amplitudenmodulation (Common Amplitude Modulation CAM) einzubeziehen und besser nichtharmonische Elemente der Töne abzuschätzen. Der vorgestellte Algorithmus zur Quellentrennung von Soloinstrument und Begleitung ermöglicht eine Echtzeitverarbeitung und ist somit relevant für den praktischen Einsatz. Ein Experiment zur besseren Modellierung der Zusammenhänge zwischen Magnitude, Phase und Feinfrequenz von isolierten Instrumententönen wurde durchgeführt. Als Ergebnis konnte die Kontinuität der zeitlichen Einhüllenden, die Inharmonizität bestimmter Musikinstrumente und die Auswertung des Phasenfortschritts für die vorgestellte Methode ausgenutzt werden. Zusätzlich wurde ein Algorithmus für die Quellentrennung in perkussive und harmonische Signalanteile auf Basis des Phasenfortschritts entwickelt. Dieser erreicht ein verbesserte perzeptuelle Qualität der harmonischen und perkussiven Signale gegenüber vergleichbaren Methoden nach dem Stand der Technik. Die vorgestellte Methode zur Klangquellentrennung in Soloinstrument und Begleitung wurde zu den Evaluationskampagnen SiSEC 2011 und SiSEC 2013 eingereicht. Dort konnten vergleichbare Ergebnisse im Hinblick auf perzeptuelle Bewertungsmaße erzielt werden. Die Qualität eines Referenzalgorithmus im Hinblick auf den in dieser Dissertation beschriebenen Instrumentaldatensatz übertroffen werden. Als ein Anwendungsszenario für die Klangquellentrennung in Solo und Begleitung wurde ein Hörtest durchgeführt, der die Qualitätsanforderungen an Quellentrennung im Kontext von Musiklernsoftware bewerten sollte. Die Ergebnisse dieses Hörtests zeigen, dass die Solo- und Begleitspur gemäß unterschiedlicher Qualitätskriterien getrennt werden sollten. Die Musiklernsoftware Songs2See integriert die vorgestellte Klangquellentrennung bereits in einer kommerziell erhältlichen Anwendung.This thesis addresses the development of a system for pitch-informed solo and accompaniment separation capable of separating main instruments from music accompaniment regardless of the musical genre of the track, or type of music accompaniment. For the solo instrument, only pitched monophonic instruments were considered in a single-channel scenario where no panning or spatial location information is available. In the proposed method, pitch information is used as an initial stage of a sinusoidal modeling approach that attempts to estimate the spectral information of the solo instrument from a given audio mixture. Instead of estimating the solo instrument on a frame by frame basis, the proposed method gathers information of tone objects to perform separation. Tone-based processing allowed the inclusion of novel processing stages for attack refinement, transient interference reduction, common amplitude modulation (CAM) of tone objects, and for better estimation of non-harmonic elements that can occur in musical instrument tones. The proposed solo and accompaniment algorithm is an efficient method suitable for real-world applications. A study was conducted to better model magnitude, frequency, and phase of isolated musical instrument tones. As a result of this study, temporal envelope smoothness, inharmonicty of musical instruments, and phase expectation were exploited in the proposed separation method. Additionally, an algorithm for harmonic/percussive separation based on phase expectation was proposed. The algorithm shows improved perceptual quality with respect to state-of-the-art methods for harmonic/percussive separation. The proposed solo and accompaniment method obtained perceptual quality scores comparable to other state-of-the-art algorithms under the SiSEC 2011 and SiSEC 2013 campaigns, and outperformed the comparison algorithm on the instrumental dataset described in this thesis.As a use-case of solo and accompaniment separation, a listening test procedure was conducted to assess separation quality requirements in the context of music education. Results from the listening test showed that solo and accompaniment tracks should be optimized differently to suit quality requirements of music education. The Songs2See application was presented as commercial music learning software which includes the proposed solo and accompaniment separation method
    corecore