7 research outputs found

    A Hybrid Approach with Multi-channel I-Vectors and Convolutional Neural Networks for Acoustic Scene Classification

    Full text link
    In Acoustic Scene Classification (ASC) two major approaches have been followed . While one utilizes engineered features such as mel-frequency-cepstral-coefficients (MFCCs), the other uses learned features that are the outcome of an optimization algorithm. I-vectors are the result of a modeling technique that usually takes engineered features as input. It has been shown that standard MFCCs extracted from monaural audio signals lead to i-vectors that exhibit poor performance, especially on indoor acoustic scenes. At the same time, Convolutional Neural Networks (CNNs) are well known for their ability to learn features by optimizing their filters. They have been applied on ASC and have shown promising results. In this paper, we first propose a novel multi-channel i-vector extraction and scoring scheme for ASC, improving their performance on indoor and outdoor scenes. Second, we propose a CNN architecture that achieves promising ASC results. Further, we show that i-vectors and CNNs capture complementary information from acoustic scenes. Finally, we propose a hybrid system for ASC using multi-channel i-vectors and CNNs by utilizing a score fusion technique. Using our method, we participated in the ASC task of the DCASE-2016 challenge. Our hybrid approach achieved 1 st rank among 49 submissions, substantially improving the previous state of the art

    Comparison for Improvements of Singing Voice Detection System Based on Vocal Separation

    Full text link
    Singing voice detection is the task to identify the frames which contain the singer vocal or not. It has been one of the main components in music information retrieval (MIR), which can be applicable to melody extraction, artist recognition, and music discovery in popular music. Although there are several methods which have been proposed, a more robust and more complete system is desired to improve the detection performance. In this paper, our motivation is to provide an extensive comparison in different stages of singing voice detection. Based on the analysis a novel method was proposed to build a more efficiently singing voice detection system. In the proposed system, there are main three parts. The first is a pre-process of singing voice separation to extract the vocal without the music. The improvements of several singing voice separation methods were compared to decide the best one which is integrated to singing voice detection system. And the second is a deep neural network based classifier to identify the given frames. Different deep models for classification were also compared. The last one is a post-process to filter out the anomaly frame on the prediction result of the classifier. The median filter and Hidden Markov Model (HMM) based filter as the post process were compared. Through the step by step module extension, the different methods were compared and analyzed. Finally, classification performance on two public datasets indicates that the proposed approach which based on the Long-term Recurrent Convolutional Networks (LRCN) model is a promising alternative.Comment: 15 page

    ๊ณ ์œ  ํŠน์„ฑ์„ ํ™œ์šฉํ•œ ์Œ์•…์—์„œ์˜ ๋ณด์ปฌ ๋ถ„๋ฆฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์œตํ•ฉ๊ณผํ•™๋ถ€, 2018. 2. ์ด๊ต๊ตฌ.๋ณด์ปฌ ๋ถ„๋ฆฌ๋ž€ ์Œ์•… ์‹ ํ˜ธ๋ฅผ ๋ณด์ปฌ ์„ฑ๋ถ„๊ณผ ๋ฐ˜์ฃผ ์„ฑ๋ถ„์œผ๋กœ ๋ถ„๋ฆฌํ•˜๋Š” ์ผ ๋˜๋Š” ๊ทธ ๋ฐฉ๋ฒ•์„ ์˜๋ฏธํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๊ธฐ์ˆ ์€ ์Œ์•…์˜ ํŠน์ •ํ•œ ์„ฑ๋ถ„์— ๋‹ด๊ฒจ ์žˆ๋Š” ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•œ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์—์„œ๋ถ€ํ„ฐ, ๋ณด์ปฌ ์—ฐ์Šต๊ณผ ๊ฐ™์ด ๋ถ„๋ฆฌ ์Œ์› ์ž์ฒด๋ฅผ ํ™œ์šฉํ•˜๋Š” ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ๋ชฉ์ ์œผ๋กœ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์˜ ๋ชฉ์ ์€ ๋ณด์ปฌ๊ณผ ๋ฐ˜์ฃผ๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๊ณ ์œ ํ•œ ํŠน์„ฑ์— ๋Œ€ํ•ด ๋…ผ์˜ํ•˜๊ณ  ๊ทธ๊ฒƒ์„ ํ™œ์šฉํ•˜์—ฌ ๋ณด์ปฌ ๋ถ„๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์„ ๊ฐœ๋ฐœํ•˜๋Š” ๊ฒƒ์ด๋ฉฐ, ํŠนํžˆ `ํŠน์ง• ๊ธฐ๋ฐ˜' ์ด๋ผ๊ณ  ๋ถˆ๋ฆฌ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ƒํ™ฉ์— ๋Œ€ํ•ด ์ค‘์ ์ ์œผ๋กœ ๋…ผ์˜ํ•œ๋‹ค. ์šฐ์„  ๋ถ„๋ฆฌ ๋Œ€์ƒ์ด ๋˜๋Š” ์Œ์•… ์‹ ํ˜ธ๋Š” ๋‹จ์ฑ„๋„๋กœ ์ œ๊ณต๋œ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฉฐ, ์ด ๊ฒฝ์šฐ ์‹ ํ˜ธ์˜ ๊ณต๊ฐ„์  ์ •๋ณด๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค์ฑ„๋„ ํ™˜๊ฒฝ์— ๋น„ํ•ด ๋”์šฑ ์–ด๋ ค์šด ํ™˜๊ฒฝ์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ ๊ธฐ๊ณ„ ํ•™์Šต ๋ฐฉ๋ฒ•์œผ๋กœ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ๊ฐ ์Œ์›์˜ ๋ชจ๋ธ์„ ์ถ”์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์ œํ•˜๋ฉฐ, ๋Œ€์‹  ์ €์ฐจ์›์˜ ํŠน์„ฑ๋“ค๋กœ๋ถ€ํ„ฐ ๋ชจ๋ธ์„ ์œ ๋„ํ•˜์—ฌ ์ด๋ฅผ ๋ชฉํ‘œ ํ•จ์ˆ˜์— ๋ฐ˜์˜ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‹œ๋„ํ•œ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ๊ฐ€์‚ฌ, ์•…๋ณด, ์‚ฌ์šฉ์ž์˜ ์•ˆ๋‚ด ๋“ฑ๊ณผ ๊ฐ™์€ ์™ธ๋ถ€์˜ ์ •๋ณด ์—ญ์‹œ ์ œ๊ณต๋˜์ง€ ์•Š๋Š”๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ณด์ปฌ ๋ถ„๋ฆฌ์˜ ๊ฒฝ์šฐ ์•”๋ฌต ์Œ์› ๋ถ„๋ฆฌ ๋ฌธ์ œ์™€๋Š” ๋‹ฌ๋ฆฌ ๋ถ„๋ฆฌํ•˜๊ณ ์ž ํ•˜๋Š” ์Œ์›์ด ๊ฐ๊ฐ ๋ณด์ปฌ๊ณผ ๋ฐ˜์ฃผ์— ํ•ด๋‹นํ•œ๋‹ค๋Š” ์ตœ์†Œํ•œ์˜ ์ •๋ณด๋Š” ์ œ๊ณต๋˜๋ฏ€๋กœ ๊ฐ๊ฐ์˜ ์„ฑ์งˆ๋“ค์— ๋Œ€ํ•œ ๋ถ„์„์€ ๊ฐ€๋Šฅํ•˜๋‹ค. ํฌ๊ฒŒ ์„ธ ์ข…๋ฅ˜์˜ ํŠน์„ฑ์ด ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ค‘์ ์ ์œผ๋กœ ๋…ผ์˜๋œ๋‹ค. ์šฐ์„  ์—ฐ์†์„ฑ์˜ ๊ฒฝ์šฐ ์ฃผํŒŒ์ˆ˜ ๋˜๋Š” ์‹œ๊ฐ„ ์ธก๋ฉด์œผ๋กœ ๊ฐ๊ฐ ๋…ผ์˜๋  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ฃผํŒŒ์ˆ˜์ถ• ์—ฐ์†์„ฑ์˜ ๊ฒฝ์šฐ ์†Œ๋ฆฌ์˜ ์Œ์ƒ‰์  ํŠน์„ฑ์„, ์‹œ๊ฐ„์ถ• ์—ฐ์†์„ฑ์€ ์†Œ๋ฆฌ๊ฐ€ ์•ˆ์ •์ ์œผ๋กœ ์ง€์†๋˜๋Š” ์ •๋„๋ฅผ ๊ฐ๊ฐ ๋‚˜ํƒ€๋‚ธ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ, ์ €ํ–‰๋ ฌ๊ณ„์ˆ˜ ํŠน์„ฑ์€ ์‹ ํ˜ธ์˜ ๊ตฌ์กฐ์  ์„ฑ์งˆ์„ ๋ฐ˜์˜ํ•˜๋ฉฐ ํ•ด๋‹น ์‹ ํ˜ธ๊ฐ€ ๋‚ฎ์€ ํ–‰๋ ฌ๊ณ„์ˆ˜๋ฅผ ๊ฐ€์ง€๋Š” ํ˜•ํƒœ๋กœ ํ‘œํ˜„๋  ์ˆ˜ ์žˆ๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ์„ฑ๊น€ ํŠน์„ฑ์€ ์‹ ํ˜ธ์˜ ๋ถ„ํฌ ํ˜•ํƒœ๊ฐ€ ์–ผ๋งˆ๋‚˜ ์„ฑ๊ธฐ๊ฑฐ๋‚˜ ์กฐ๋ฐ€ํ•œ์ง€๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํฌ๊ฒŒ ๋‘ ๊ฐ€์ง€์˜ ๋ณด์ปฌ ๋ถ„๋ฆฌ ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ๋…ผ์˜ํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์€ ์—ฐ์†์„ฑ๊ณผ ์„ฑ๊น€ ํŠน์„ฑ์— ๊ธฐ๋ฐ˜์„ ๋‘๊ณ  ํ™”์„ฑ ์•…๊ธฐ-ํƒ€์•…๊ธฐ ๋ถ„๋ฆฌ ๋ฐฉ๋ฒ• (harmonic-percussive sound separation, HPSS) ์„ ํ™•์žฅํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ๊ธฐ์กด์˜ ๋ฐฉ๋ฒ•์ด ๋‘ ๋ฒˆ์˜ HPSS ๊ณผ์ •์„ ํ†ตํ•ด ๋ณด์ปฌ์„ ๋ถ„๋ฆฌํ•˜๋Š” ๊ฒƒ์— ๋น„ํ•ด ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์„ฑ๊ธด ์ž”์—ฌ ์„ฑ๋ถ„์„ ์ถ”๊ฐ€ํ•ด ํ•œ ๋ฒˆ์˜ ๋ณด์ปฌ ๋ถ„๋ฆฌ ๊ณผ์ •๋งŒ์„ ์‚ฌ์šฉํ•œ๋‹ค. ๋…ผ์˜๋˜๋Š” ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์€ ์ €ํ–‰๋ ฌ๊ณ„์ˆ˜ ํŠน์„ฑ๊ณผ ์„ฑ๊น€ ํŠน์„ฑ์„ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์œผ๋กœ, ๋ฐ˜์ฃผ๊ฐ€ ์ €ํ–‰๋ ฌ๊ณ„์ˆ˜ ๋ชจ๋ธ๋กœ ํ‘œํ˜„๋  ์ˆ˜ ์žˆ๋Š” ๋ฐ˜๋ฉด ๋ณด์ปฌ์€ ์„ฑ๊ธด ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง„๋‹ค๋Š” ๊ฐ€์ •์— ๊ธฐ๋ฐ˜์„ ๋‘”๋‹ค. ์ด๋Ÿฌํ•œ ์„ฑ๋ถ„๋“ค์„ ๋ถ„๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ•์ธํ•œ ์ฃผ์„ฑ๋ถ„ ๋ถ„์„ (robust principal component analysis, RPCA) ์„ ์ด์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๋Œ€ํ‘œ์ ์ด๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋ณด์ปฌ ๋ถ„๋ฆฌ ์„ฑ๋Šฅ์— ์ดˆ์ ์„ ๋‘๊ณ  RPCA ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ผ๋ฐ˜ํ™”ํ•˜๊ฑฐ๋‚˜ ํ™•์žฅํ•˜๋Š” ๋ฐฉ์‹์— ๋Œ€ํ•ด ๋…ผ์˜ํ•˜๋ฉฐ, ํŠธ๋ ˆ์ด์Šค ๋…ธ๋ฆ„๊ณผ l1 ๋…ธ๋ฆ„์„ ๊ฐ๊ฐ ์ƒคํ… p ๋…ธ๋ฆ„๊ณผ lp ๋…ธ๋ฆ„์œผ๋กœ ๋Œ€์ฒดํ•˜๋Š” ๋ฐฉ๋ฒ•, ์Šค์ผ€์ผ ์••์ถ• ๋ฐฉ๋ฒ•, ์ฃผํŒŒ์ˆ˜ ๋ถ„ํฌ ํŠน์„ฑ์„ ๋ฐ˜์˜ํ•˜๋Š” ๋ฐฉ๋ฒ• ๋“ฑ์„ ํฌํ•จํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์€ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹๊ณผ ๋Œ€ํšŒ์—์„œ ํ‰๊ฐ€๋˜์—ˆ์œผ๋ฉฐ ์ตœ์‹ ์˜ ๋ณด์ปฌ ๋ถ„๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค๋ณด๋‹ค ๋” ์šฐ์ˆ˜ํ•˜๊ฑฐ๋‚˜ ๋น„์Šทํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๋‹ค.Singing voice separation (SVS) refers to the task or the method of decomposing music signal into singing voice and its accompanying instruments. It has various uses, from the preprocessing step, to extract the musical features implied in the target source, to applications for itself such as vocal training. This thesis aims to discover the common properties of singing voice and accompaniment, and apply it to advance the state-of-the-art SVS algorithms. In particular, the separation approach as follows, which is named `characteristics-based,' is concentrated in this thesis. First, the music signal is assumed to be provided in monaural, or as a single-channel recording. It is more difficult condition compared to multiple-channel recording since spatial information cannot be applied in the separation procedure. This thesis also focuses on unsupervised approach, that does not use machine learning technique to estimate the source model from the training data. The models are instead derived based on the low-level characteristics and applied to the objective function. Finally, no external information such as lyrics, score, or user guide is provided. Unlike blind source separation problems, however, the classes of the target sources, singing voice and accompaniment, are known in SVS problem, and it allows to estimate those respective properties. Three different characteristics are primarily discussed in this thesis. Continuity, in the spectral or temporal dimension, refers the smoothness of the source in the particular aspect. The spectral continuity is related with the timbre, while the temporal continuity represents the stability of sounds. On the other hand, the low-rankness refers how the signal is well-structured and can be represented as a low-rank data, and the sparsity represents how rarely the sounds in signals occur in time and frequency. This thesis discusses two SVS approaches using above characteristics. First one is based on the continuity and sparsity, which extends the harmonic-percussive sound separation (HPSS). While the conventional algorithm separates singing voice by using a two-stage HPSS, the proposed one has a single stage procedure but with an additional sparse residual term in the objective function. Another SVS approach is based on the low-rankness and sparsity. Assuming that accompaniment can be represented as a low-rank model, whereas singing voice has a sparse distribution, conventional algorithm decomposes the sources by using robust principal component analysis (RPCA). In this thesis, generalization or extension of RPCA especially for SVS is discussed, including the use of Schatten p-/lp-norm, scale compression, and spectral distribution. The presented algorithms are evaluated using various datasets and challenges and achieved the better comparable results compared to the state-of-the-art algorithms.Chapter 1 Introduction 1 1.1 Motivation 4 1.2 Applications 5 1.3 Definitions and keywords 6 1.4 Evaluation criteria 7 1.5 Topics of interest 11 1.6 Outline of the thesis 13 Chapter 2 Background 15 2.1 Spectrogram-domain separation framework 15 2.2 Approaches for singing voice separation 19 2.2.1 Characteristics-based approach 20 2.2.2 Spatial approach 21 2.2.3 Machine learning-based approach 22 2.2.4 informed approach 23 2.3 Datasets and challenges 25 2.3.1 Datasets 25 2.3.2 Challenges 26 Chapter 3 Characteristics of music sources 28 3.1 Introduction 28 3.2 Spectral/temporal continuity 29 3.2.1 Continuity of a spectrogram 29 3.2.2 Continuity of musical sources 30 3.3 Low-rankness 31 3.3.1 Low-rankness of a spectrogram 31 3.3.2 Low-rankness of musical sources 33 3.4 Sparsity 34 3.4.1 Sparsity of a spectrogram 34 3.4.2 Sparsity of musical sources 36 3.5 Experiments 38 3.6 Summary 39 Chapter 4 Singing voice separation using continuity and sparsity 43 4.1 Introduction 43 4.2 SVS using two-stage HPSS 45 4.2.1 Harmonic-percussive sound separation 45 4.2.2 SVS using two-stage HPSS 46 4.3 Proposed algorithm 48 4.4 Experimental evaluation 52 4.4.1 MIR-1k Dataset 52 4.4.2 Beach boys Dataset 55 4.4.3 iKala dataset in MIREX 2014 56 4.5 Conclusion 58 Chapter 5 Singing voice separation using low-rankness and sparsity 61 5.1 Introduction 61 5.2 SVS using robust principal component analysis 63 5.2.1 Robust principal component analysis 63 5.2.2 Optimization for RPCA using augmented Lagrangian multiplier method 63 5.2.3 SVS using RPCA 65 5.3 SVS using generalized RPCA 67 5.3.1 Generalized RPCA using Schatten p- and lp-norm 67 5.3.2 Comparison of pRPCA with robust matrix completion 68 5.3.3 Optimization method of pRPCA 69 5.3.4 Discussion of the normalization factor for ฮป 69 5.3.5 Generalized RPCA using scale compression 71 5.3.6 Experimental results 72 5.4 SVS using RPCA and spectral distribution 73 5.4.1 RPCA with weighted l1-norm 73 5.4.2 Proposed method: SVS using wRPCA 74 5.4.3 Experimental results using DSD100 dataset 78 5.4.4 Comparison with state-of-the-arts in SiSEC 2016 79 5.4.5 Discussion 85 5.5 Summary 86 Chapter 6 Conclusion and Future Work 88 6.1 Conclusion 88 6.2 Contributions 89 6.3 Future work 91 6.3.1 Discovering various characteristics for SVS 91 6.3.2 Expanding to other SVS approaches 92 6.3.3 Applying the characteristics for deep learning models 92 Bibliography 94 ์ดˆ ๋ก 110Docto

    A Critical Look at the Music Classification Experiment Pipeline: Using Interventions to Detect and Account for Confounding Effects

    Get PDF
    PhD ThesisThis dissertation focuses on the problemof confounding in the design and analysis of music classification experiments. Classification experiments dominate evaluation of music content analysis systems and methods, but achieving high performance on such experiments does not guarantee systems properly address the intended problem. The research presented here proposes and illustrates modifications to the conventional experimental pipeline, which aim at improving the understanding of the evaluated systems and methods, facilitating valid conclusions on their suitability for the target problem. Firstly,multiple analyses are conducted to determinewhich cues scattering-based systems use to predict the annotations of the GTZAN music genre collection. In-depth system analysis informs empirical approaches that alter the experimental pipeline. In particular, deflation manipulations and targeted interventions on the partitioning strategy, the learning algorithm and the frequency content of the data reveal that systems using scattering-based features exploit faults in GTZAN and previously unknown information at inaudible frequencies. Secondly, the use of interventions on the experimental pipeline is extended and systematised to a procedure for characterising effects of confounding information in the results of classification experiments. Regulated bootstrap, a novel resampling strategy, is proposed to address challenges associated with interventions dealing with partitioning. The procedure is demonstrated on GTZAN, analysing the effect of artist replication and infrasonic information on performance measurements using a wide range of systemconstruction methods. Finally, mathematical models relating measurements from classification experiments and potentially contributing factors are proposed and discussed. Suchmodels enable decomposing measurements into contributions of interest, which may differ depending on the goals of the study, including those from pipeline interventions. The adequacy for classification experiments of some conventional assumptions underlying such models is also examined. The reported research highlights the need for evaluation procedures that go beyond performance maximisation. Accounting for the effects of confounding information using procedures grounded on the principles of experimental design promises to facilitate the development of systems that generalise beyond the restricted experimental settings
    corecore