2 research outputs found

    ๊ณ ์œ  ํŠน์„ฑ์„ ํ™œ์šฉํ•œ ์Œ์•…์—์„œ์˜ ๋ณด์ปฌ ๋ถ„๋ฆฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์œตํ•ฉ๊ณผํ•™๋ถ€, 2018. 2. ์ด๊ต๊ตฌ.๋ณด์ปฌ ๋ถ„๋ฆฌ๋ž€ ์Œ์•… ์‹ ํ˜ธ๋ฅผ ๋ณด์ปฌ ์„ฑ๋ถ„๊ณผ ๋ฐ˜์ฃผ ์„ฑ๋ถ„์œผ๋กœ ๋ถ„๋ฆฌํ•˜๋Š” ์ผ ๋˜๋Š” ๊ทธ ๋ฐฉ๋ฒ•์„ ์˜๋ฏธํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๊ธฐ์ˆ ์€ ์Œ์•…์˜ ํŠน์ •ํ•œ ์„ฑ๋ถ„์— ๋‹ด๊ฒจ ์žˆ๋Š” ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•œ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์—์„œ๋ถ€ํ„ฐ, ๋ณด์ปฌ ์—ฐ์Šต๊ณผ ๊ฐ™์ด ๋ถ„๋ฆฌ ์Œ์› ์ž์ฒด๋ฅผ ํ™œ์šฉํ•˜๋Š” ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ๋ชฉ์ ์œผ๋กœ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์˜ ๋ชฉ์ ์€ ๋ณด์ปฌ๊ณผ ๋ฐ˜์ฃผ๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๊ณ ์œ ํ•œ ํŠน์„ฑ์— ๋Œ€ํ•ด ๋…ผ์˜ํ•˜๊ณ  ๊ทธ๊ฒƒ์„ ํ™œ์šฉํ•˜์—ฌ ๋ณด์ปฌ ๋ถ„๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์„ ๊ฐœ๋ฐœํ•˜๋Š” ๊ฒƒ์ด๋ฉฐ, ํŠนํžˆ `ํŠน์ง• ๊ธฐ๋ฐ˜' ์ด๋ผ๊ณ  ๋ถˆ๋ฆฌ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ƒํ™ฉ์— ๋Œ€ํ•ด ์ค‘์ ์ ์œผ๋กœ ๋…ผ์˜ํ•œ๋‹ค. ์šฐ์„  ๋ถ„๋ฆฌ ๋Œ€์ƒ์ด ๋˜๋Š” ์Œ์•… ์‹ ํ˜ธ๋Š” ๋‹จ์ฑ„๋„๋กœ ์ œ๊ณต๋œ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฉฐ, ์ด ๊ฒฝ์šฐ ์‹ ํ˜ธ์˜ ๊ณต๊ฐ„์  ์ •๋ณด๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค์ฑ„๋„ ํ™˜๊ฒฝ์— ๋น„ํ•ด ๋”์šฑ ์–ด๋ ค์šด ํ™˜๊ฒฝ์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ ๊ธฐ๊ณ„ ํ•™์Šต ๋ฐฉ๋ฒ•์œผ๋กœ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ๊ฐ ์Œ์›์˜ ๋ชจ๋ธ์„ ์ถ”์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์ œํ•˜๋ฉฐ, ๋Œ€์‹  ์ €์ฐจ์›์˜ ํŠน์„ฑ๋“ค๋กœ๋ถ€ํ„ฐ ๋ชจ๋ธ์„ ์œ ๋„ํ•˜์—ฌ ์ด๋ฅผ ๋ชฉํ‘œ ํ•จ์ˆ˜์— ๋ฐ˜์˜ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‹œ๋„ํ•œ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ๊ฐ€์‚ฌ, ์•…๋ณด, ์‚ฌ์šฉ์ž์˜ ์•ˆ๋‚ด ๋“ฑ๊ณผ ๊ฐ™์€ ์™ธ๋ถ€์˜ ์ •๋ณด ์—ญ์‹œ ์ œ๊ณต๋˜์ง€ ์•Š๋Š”๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ณด์ปฌ ๋ถ„๋ฆฌ์˜ ๊ฒฝ์šฐ ์•”๋ฌต ์Œ์› ๋ถ„๋ฆฌ ๋ฌธ์ œ์™€๋Š” ๋‹ฌ๋ฆฌ ๋ถ„๋ฆฌํ•˜๊ณ ์ž ํ•˜๋Š” ์Œ์›์ด ๊ฐ๊ฐ ๋ณด์ปฌ๊ณผ ๋ฐ˜์ฃผ์— ํ•ด๋‹นํ•œ๋‹ค๋Š” ์ตœ์†Œํ•œ์˜ ์ •๋ณด๋Š” ์ œ๊ณต๋˜๋ฏ€๋กœ ๊ฐ๊ฐ์˜ ์„ฑ์งˆ๋“ค์— ๋Œ€ํ•œ ๋ถ„์„์€ ๊ฐ€๋Šฅํ•˜๋‹ค. ํฌ๊ฒŒ ์„ธ ์ข…๋ฅ˜์˜ ํŠน์„ฑ์ด ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ค‘์ ์ ์œผ๋กœ ๋…ผ์˜๋œ๋‹ค. ์šฐ์„  ์—ฐ์†์„ฑ์˜ ๊ฒฝ์šฐ ์ฃผํŒŒ์ˆ˜ ๋˜๋Š” ์‹œ๊ฐ„ ์ธก๋ฉด์œผ๋กœ ๊ฐ๊ฐ ๋…ผ์˜๋  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ฃผํŒŒ์ˆ˜์ถ• ์—ฐ์†์„ฑ์˜ ๊ฒฝ์šฐ ์†Œ๋ฆฌ์˜ ์Œ์ƒ‰์  ํŠน์„ฑ์„, ์‹œ๊ฐ„์ถ• ์—ฐ์†์„ฑ์€ ์†Œ๋ฆฌ๊ฐ€ ์•ˆ์ •์ ์œผ๋กœ ์ง€์†๋˜๋Š” ์ •๋„๋ฅผ ๊ฐ๊ฐ ๋‚˜ํƒ€๋‚ธ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ, ์ €ํ–‰๋ ฌ๊ณ„์ˆ˜ ํŠน์„ฑ์€ ์‹ ํ˜ธ์˜ ๊ตฌ์กฐ์  ์„ฑ์งˆ์„ ๋ฐ˜์˜ํ•˜๋ฉฐ ํ•ด๋‹น ์‹ ํ˜ธ๊ฐ€ ๋‚ฎ์€ ํ–‰๋ ฌ๊ณ„์ˆ˜๋ฅผ ๊ฐ€์ง€๋Š” ํ˜•ํƒœ๋กœ ํ‘œํ˜„๋  ์ˆ˜ ์žˆ๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ์„ฑ๊น€ ํŠน์„ฑ์€ ์‹ ํ˜ธ์˜ ๋ถ„ํฌ ํ˜•ํƒœ๊ฐ€ ์–ผ๋งˆ๋‚˜ ์„ฑ๊ธฐ๊ฑฐ๋‚˜ ์กฐ๋ฐ€ํ•œ์ง€๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํฌ๊ฒŒ ๋‘ ๊ฐ€์ง€์˜ ๋ณด์ปฌ ๋ถ„๋ฆฌ ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ๋…ผ์˜ํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์€ ์—ฐ์†์„ฑ๊ณผ ์„ฑ๊น€ ํŠน์„ฑ์— ๊ธฐ๋ฐ˜์„ ๋‘๊ณ  ํ™”์„ฑ ์•…๊ธฐ-ํƒ€์•…๊ธฐ ๋ถ„๋ฆฌ ๋ฐฉ๋ฒ• (harmonic-percussive sound separation, HPSS) ์„ ํ™•์žฅํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ๊ธฐ์กด์˜ ๋ฐฉ๋ฒ•์ด ๋‘ ๋ฒˆ์˜ HPSS ๊ณผ์ •์„ ํ†ตํ•ด ๋ณด์ปฌ์„ ๋ถ„๋ฆฌํ•˜๋Š” ๊ฒƒ์— ๋น„ํ•ด ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์„ฑ๊ธด ์ž”์—ฌ ์„ฑ๋ถ„์„ ์ถ”๊ฐ€ํ•ด ํ•œ ๋ฒˆ์˜ ๋ณด์ปฌ ๋ถ„๋ฆฌ ๊ณผ์ •๋งŒ์„ ์‚ฌ์šฉํ•œ๋‹ค. ๋…ผ์˜๋˜๋Š” ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์€ ์ €ํ–‰๋ ฌ๊ณ„์ˆ˜ ํŠน์„ฑ๊ณผ ์„ฑ๊น€ ํŠน์„ฑ์„ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์œผ๋กœ, ๋ฐ˜์ฃผ๊ฐ€ ์ €ํ–‰๋ ฌ๊ณ„์ˆ˜ ๋ชจ๋ธ๋กœ ํ‘œํ˜„๋  ์ˆ˜ ์žˆ๋Š” ๋ฐ˜๋ฉด ๋ณด์ปฌ์€ ์„ฑ๊ธด ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง„๋‹ค๋Š” ๊ฐ€์ •์— ๊ธฐ๋ฐ˜์„ ๋‘”๋‹ค. ์ด๋Ÿฌํ•œ ์„ฑ๋ถ„๋“ค์„ ๋ถ„๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ•์ธํ•œ ์ฃผ์„ฑ๋ถ„ ๋ถ„์„ (robust principal component analysis, RPCA) ์„ ์ด์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๋Œ€ํ‘œ์ ์ด๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋ณด์ปฌ ๋ถ„๋ฆฌ ์„ฑ๋Šฅ์— ์ดˆ์ ์„ ๋‘๊ณ  RPCA ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ผ๋ฐ˜ํ™”ํ•˜๊ฑฐ๋‚˜ ํ™•์žฅํ•˜๋Š” ๋ฐฉ์‹์— ๋Œ€ํ•ด ๋…ผ์˜ํ•˜๋ฉฐ, ํŠธ๋ ˆ์ด์Šค ๋…ธ๋ฆ„๊ณผ l1 ๋…ธ๋ฆ„์„ ๊ฐ๊ฐ ์ƒคํ… p ๋…ธ๋ฆ„๊ณผ lp ๋…ธ๋ฆ„์œผ๋กœ ๋Œ€์ฒดํ•˜๋Š” ๋ฐฉ๋ฒ•, ์Šค์ผ€์ผ ์••์ถ• ๋ฐฉ๋ฒ•, ์ฃผํŒŒ์ˆ˜ ๋ถ„ํฌ ํŠน์„ฑ์„ ๋ฐ˜์˜ํ•˜๋Š” ๋ฐฉ๋ฒ• ๋“ฑ์„ ํฌํ•จํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์€ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹๊ณผ ๋Œ€ํšŒ์—์„œ ํ‰๊ฐ€๋˜์—ˆ์œผ๋ฉฐ ์ตœ์‹ ์˜ ๋ณด์ปฌ ๋ถ„๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค๋ณด๋‹ค ๋” ์šฐ์ˆ˜ํ•˜๊ฑฐ๋‚˜ ๋น„์Šทํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๋‹ค.Singing voice separation (SVS) refers to the task or the method of decomposing music signal into singing voice and its accompanying instruments. It has various uses, from the preprocessing step, to extract the musical features implied in the target source, to applications for itself such as vocal training. This thesis aims to discover the common properties of singing voice and accompaniment, and apply it to advance the state-of-the-art SVS algorithms. In particular, the separation approach as follows, which is named `characteristics-based,' is concentrated in this thesis. First, the music signal is assumed to be provided in monaural, or as a single-channel recording. It is more difficult condition compared to multiple-channel recording since spatial information cannot be applied in the separation procedure. This thesis also focuses on unsupervised approach, that does not use machine learning technique to estimate the source model from the training data. The models are instead derived based on the low-level characteristics and applied to the objective function. Finally, no external information such as lyrics, score, or user guide is provided. Unlike blind source separation problems, however, the classes of the target sources, singing voice and accompaniment, are known in SVS problem, and it allows to estimate those respective properties. Three different characteristics are primarily discussed in this thesis. Continuity, in the spectral or temporal dimension, refers the smoothness of the source in the particular aspect. The spectral continuity is related with the timbre, while the temporal continuity represents the stability of sounds. On the other hand, the low-rankness refers how the signal is well-structured and can be represented as a low-rank data, and the sparsity represents how rarely the sounds in signals occur in time and frequency. This thesis discusses two SVS approaches using above characteristics. First one is based on the continuity and sparsity, which extends the harmonic-percussive sound separation (HPSS). While the conventional algorithm separates singing voice by using a two-stage HPSS, the proposed one has a single stage procedure but with an additional sparse residual term in the objective function. Another SVS approach is based on the low-rankness and sparsity. Assuming that accompaniment can be represented as a low-rank model, whereas singing voice has a sparse distribution, conventional algorithm decomposes the sources by using robust principal component analysis (RPCA). In this thesis, generalization or extension of RPCA especially for SVS is discussed, including the use of Schatten p-/lp-norm, scale compression, and spectral distribution. The presented algorithms are evaluated using various datasets and challenges and achieved the better comparable results compared to the state-of-the-art algorithms.Chapter 1 Introduction 1 1.1 Motivation 4 1.2 Applications 5 1.3 Definitions and keywords 6 1.4 Evaluation criteria 7 1.5 Topics of interest 11 1.6 Outline of the thesis 13 Chapter 2 Background 15 2.1 Spectrogram-domain separation framework 15 2.2 Approaches for singing voice separation 19 2.2.1 Characteristics-based approach 20 2.2.2 Spatial approach 21 2.2.3 Machine learning-based approach 22 2.2.4 informed approach 23 2.3 Datasets and challenges 25 2.3.1 Datasets 25 2.3.2 Challenges 26 Chapter 3 Characteristics of music sources 28 3.1 Introduction 28 3.2 Spectral/temporal continuity 29 3.2.1 Continuity of a spectrogram 29 3.2.2 Continuity of musical sources 30 3.3 Low-rankness 31 3.3.1 Low-rankness of a spectrogram 31 3.3.2 Low-rankness of musical sources 33 3.4 Sparsity 34 3.4.1 Sparsity of a spectrogram 34 3.4.2 Sparsity of musical sources 36 3.5 Experiments 38 3.6 Summary 39 Chapter 4 Singing voice separation using continuity and sparsity 43 4.1 Introduction 43 4.2 SVS using two-stage HPSS 45 4.2.1 Harmonic-percussive sound separation 45 4.2.2 SVS using two-stage HPSS 46 4.3 Proposed algorithm 48 4.4 Experimental evaluation 52 4.4.1 MIR-1k Dataset 52 4.4.2 Beach boys Dataset 55 4.4.3 iKala dataset in MIREX 2014 56 4.5 Conclusion 58 Chapter 5 Singing voice separation using low-rankness and sparsity 61 5.1 Introduction 61 5.2 SVS using robust principal component analysis 63 5.2.1 Robust principal component analysis 63 5.2.2 Optimization for RPCA using augmented Lagrangian multiplier method 63 5.2.3 SVS using RPCA 65 5.3 SVS using generalized RPCA 67 5.3.1 Generalized RPCA using Schatten p- and lp-norm 67 5.3.2 Comparison of pRPCA with robust matrix completion 68 5.3.3 Optimization method of pRPCA 69 5.3.4 Discussion of the normalization factor for ฮป 69 5.3.5 Generalized RPCA using scale compression 71 5.3.6 Experimental results 72 5.4 SVS using RPCA and spectral distribution 73 5.4.1 RPCA with weighted l1-norm 73 5.4.2 Proposed method: SVS using wRPCA 74 5.4.3 Experimental results using DSD100 dataset 78 5.4.4 Comparison with state-of-the-arts in SiSEC 2016 79 5.4.5 Discussion 85 5.5 Summary 86 Chapter 6 Conclusion and Future Work 88 6.1 Conclusion 88 6.2 Contributions 89 6.3 Future work 91 6.3.1 Discovering various characteristics for SVS 91 6.3.2 Expanding to other SVS approaches 92 6.3.3 Applying the characteristics for deep learning models 92 Bibliography 94 ์ดˆ ๋ก 110Docto
    corecore