7 research outputs found
A Hybrid Approach with Multi-channel I-Vectors and Convolutional Neural Networks for Acoustic Scene Classification
In Acoustic Scene Classification (ASC) two major approaches have been
followed . While one utilizes engineered features such as
mel-frequency-cepstral-coefficients (MFCCs), the other uses learned features
that are the outcome of an optimization algorithm. I-vectors are the result of
a modeling technique that usually takes engineered features as input. It has
been shown that standard MFCCs extracted from monaural audio signals lead to
i-vectors that exhibit poor performance, especially on indoor acoustic scenes.
At the same time, Convolutional Neural Networks (CNNs) are well known for their
ability to learn features by optimizing their filters. They have been applied
on ASC and have shown promising results. In this paper, we first propose a
novel multi-channel i-vector extraction and scoring scheme for ASC, improving
their performance on indoor and outdoor scenes. Second, we propose a CNN
architecture that achieves promising ASC results. Further, we show that
i-vectors and CNNs capture complementary information from acoustic scenes.
Finally, we propose a hybrid system for ASC using multi-channel i-vectors and
CNNs by utilizing a score fusion technique. Using our method, we participated
in the ASC task of the DCASE-2016 challenge. Our hybrid approach achieved 1 st
rank among 49 submissions, substantially improving the previous state of the
art
Comparison for Improvements of Singing Voice Detection System Based on Vocal Separation
Singing voice detection is the task to identify the frames which contain the
singer vocal or not. It has been one of the main components in music
information retrieval (MIR), which can be applicable to melody extraction,
artist recognition, and music discovery in popular music. Although there are
several methods which have been proposed, a more robust and more complete
system is desired to improve the detection performance. In this paper, our
motivation is to provide an extensive comparison in different stages of singing
voice detection. Based on the analysis a novel method was proposed to build a
more efficiently singing voice detection system. In the proposed system, there
are main three parts. The first is a pre-process of singing voice separation to
extract the vocal without the music. The improvements of several singing voice
separation methods were compared to decide the best one which is integrated to
singing voice detection system. And the second is a deep neural network based
classifier to identify the given frames. Different deep models for
classification were also compared. The last one is a post-process to filter out
the anomaly frame on the prediction result of the classifier. The median filter
and Hidden Markov Model (HMM) based filter as the post process were compared.
Through the step by step module extension, the different methods were compared
and analyzed. Finally, classification performance on two public datasets
indicates that the proposed approach which based on the Long-term Recurrent
Convolutional Networks (LRCN) model is a promising alternative.Comment: 15 page
๊ณ ์ ํน์ฑ์ ํ์ฉํ ์์ ์์์ ๋ณด์ปฌ ๋ถ๋ฆฌ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ : ์ตํฉ๊ณผํ๊ธฐ์ ๋ํ์ ์ตํฉ๊ณผํ๋ถ, 2018. 2. ์ด๊ต๊ตฌ.๋ณด์ปฌ ๋ถ๋ฆฌ๋ ์์
์ ํธ๋ฅผ ๋ณด์ปฌ ์ฑ๋ถ๊ณผ ๋ฐ์ฃผ ์ฑ๋ถ์ผ๋ก ๋ถ๋ฆฌํ๋ ์ผ ๋๋ ๊ทธ ๋ฐฉ๋ฒ์ ์๋ฏธํ๋ค. ์ด๋ฌํ ๊ธฐ์ ์ ์์
์ ํน์ ํ ์ฑ๋ถ์ ๋ด๊ฒจ ์๋ ์ ๋ณด๋ฅผ ์ถ์ถํ๊ธฐ ์ํ ์ ์ฒ๋ฆฌ ๊ณผ์ ์์๋ถํฐ, ๋ณด์ปฌ ์ฐ์ต๊ณผ ๊ฐ์ด ๋ถ๋ฆฌ ์์ ์์ฒด๋ฅผ ํ์ฉํ๋ ๋ฑ์ ๋ค์ํ ๋ชฉ์ ์ผ๋ก ์ฌ์ฉ๋ ์ ์๋ค.
๋ณธ ๋
ผ๋ฌธ์ ๋ชฉ์ ์ ๋ณด์ปฌ๊ณผ ๋ฐ์ฃผ๊ฐ ๊ฐ์ง๊ณ ์๋ ๊ณ ์ ํ ํน์ฑ์ ๋ํด ๋
ผ์ํ๊ณ ๊ทธ๊ฒ์ ํ์ฉํ์ฌ ๋ณด์ปฌ ๋ถ๋ฆฌ ์๊ณ ๋ฆฌ์ฆ๋ค์ ๊ฐ๋ฐํ๋ ๊ฒ์ด๋ฉฐ, ํนํ `ํน์ง ๊ธฐ๋ฐ' ์ด๋ผ๊ณ ๋ถ๋ฆฌ๋ ๋ค์๊ณผ ๊ฐ์ ์ํฉ์ ๋ํด ์ค์ ์ ์ผ๋ก ๋
ผ์ํ๋ค. ์ฐ์ ๋ถ๋ฆฌ ๋์์ด ๋๋ ์์
์ ํธ๋ ๋จ์ฑ๋๋ก ์ ๊ณต๋๋ค๊ณ ๊ฐ์ ํ๋ฉฐ, ์ด ๊ฒฝ์ฐ ์ ํธ์ ๊ณต๊ฐ์ ์ ๋ณด๋ฅผ ํ์ฉํ ์ ์๋ ๋ค์ฑ๋ ํ๊ฒฝ์ ๋นํด ๋์ฑ ์ด๋ ค์ด ํ๊ฒฝ์ด๋ผ๊ณ ๋ณผ ์ ์๋ค. ๋ํ ๊ธฐ๊ณ ํ์ต ๋ฐฉ๋ฒ์ผ๋ก ๋ฐ์ดํฐ๋ก๋ถํฐ ๊ฐ ์์์ ๋ชจ๋ธ์ ์ถ์ ํ๋ ๋ฐฉ๋ฒ์ ๋ฐฐ์ ํ๋ฉฐ, ๋์ ์ ์ฐจ์์ ํน์ฑ๋ค๋ก๋ถํฐ ๋ชจ๋ธ์ ์ ๋ํ์ฌ ์ด๋ฅผ ๋ชฉํ ํจ์์ ๋ฐ์ํ๋ ๋ฐฉ๋ฒ์ ์๋ํ๋ค. ๋ง์ง๋ง์ผ๋ก, ๊ฐ์ฌ, ์
๋ณด, ์ฌ์ฉ์์ ์๋ด ๋ฑ๊ณผ ๊ฐ์ ์ธ๋ถ์ ์ ๋ณด ์ญ์ ์ ๊ณต๋์ง ์๋๋ค๊ณ ๊ฐ์ ํ๋ค. ๊ทธ๋ฌ๋ ๋ณด์ปฌ ๋ถ๋ฆฌ์ ๊ฒฝ์ฐ ์๋ฌต ์์ ๋ถ๋ฆฌ ๋ฌธ์ ์๋ ๋ฌ๋ฆฌ ๋ถ๋ฆฌํ๊ณ ์ ํ๋ ์์์ด ๊ฐ๊ฐ ๋ณด์ปฌ๊ณผ ๋ฐ์ฃผ์ ํด๋นํ๋ค๋ ์ต์ํ์ ์ ๋ณด๋ ์ ๊ณต๋๋ฏ๋ก ๊ฐ๊ฐ์ ์ฑ์ง๋ค์ ๋ํ ๋ถ์์ ๊ฐ๋ฅํ๋ค.
ํฌ๊ฒ ์ธ ์ข
๋ฅ์ ํน์ฑ์ด ๋ณธ ๋
ผ๋ฌธ์์ ์ค์ ์ ์ผ๋ก ๋
ผ์๋๋ค. ์ฐ์ ์ฐ์์ฑ์ ๊ฒฝ์ฐ ์ฃผํ์ ๋๋ ์๊ฐ ์ธก๋ฉด์ผ๋ก ๊ฐ๊ฐ ๋
ผ์๋ ์ ์๋๋ฐ, ์ฃผํ์์ถ ์ฐ์์ฑ์ ๊ฒฝ์ฐ ์๋ฆฌ์ ์์์ ํน์ฑ์, ์๊ฐ์ถ ์ฐ์์ฑ์ ์๋ฆฌ๊ฐ ์์ ์ ์ผ๋ก ์ง์๋๋ ์ ๋๋ฅผ ๊ฐ๊ฐ ๋ํ๋ธ๋ค๊ณ ๋ณผ ์ ์๋ค. ๋ํ, ์ ํ๋ ฌ๊ณ์ ํน์ฑ์ ์ ํธ์ ๊ตฌ์กฐ์ ์ฑ์ง์ ๋ฐ์ํ๋ฉฐ ํด๋น ์ ํธ๊ฐ ๋ฎ์ ํ๋ ฌ๊ณ์๋ฅผ ๊ฐ์ง๋ ํํ๋ก ํํ๋ ์ ์๋์ง๋ฅผ ๋ํ๋ด๋ฉฐ, ์ฑ๊น ํน์ฑ์ ์ ํธ์ ๋ถํฌ ํํ๊ฐ ์ผ๋ง๋ ์ฑ๊ธฐ๊ฑฐ๋ ์กฐ๋ฐํ์ง๋ฅผ ๋ํ๋ธ๋ค.
๋ณธ ๋
ผ๋ฌธ์์๋ ํฌ๊ฒ ๋ ๊ฐ์ง์ ๋ณด์ปฌ ๋ถ๋ฆฌ ๋ฐฉ๋ฒ์ ๋ํด ๋
ผ์ํ๋ค. ์ฒซ ๋ฒ์งธ ๋ฐฉ๋ฒ์ ์ฐ์์ฑ๊ณผ ์ฑ๊น ํน์ฑ์ ๊ธฐ๋ฐ์ ๋๊ณ ํ์ฑ ์
๊ธฐ-ํ์
๊ธฐ ๋ถ๋ฆฌ ๋ฐฉ๋ฒ (harmonic-percussive sound separation, HPSS) ์ ํ์ฅํ๋ ๋ฐฉ๋ฒ์ด๋ค. ๊ธฐ์กด์ ๋ฐฉ๋ฒ์ด ๋ ๋ฒ์ HPSS ๊ณผ์ ์ ํตํด ๋ณด์ปฌ์ ๋ถ๋ฆฌํ๋ ๊ฒ์ ๋นํด ์ ์ํ๋ ๋ฐฉ๋ฒ์ ์ฑ๊ธด ์์ฌ ์ฑ๋ถ์ ์ถ๊ฐํด ํ ๋ฒ์ ๋ณด์ปฌ ๋ถ๋ฆฌ ๊ณผ์ ๋ง์ ์ฌ์ฉํ๋ค. ๋
ผ์๋๋ ๋ค๋ฅธ ๋ฐฉ๋ฒ์ ์ ํ๋ ฌ๊ณ์ ํน์ฑ๊ณผ ์ฑ๊น ํน์ฑ์ ํ์ฉํ๋ ๊ฒ์ผ๋ก, ๋ฐ์ฃผ๊ฐ ์ ํ๋ ฌ๊ณ์ ๋ชจ๋ธ๋ก ํํ๋ ์ ์๋ ๋ฐ๋ฉด ๋ณด์ปฌ์ ์ฑ๊ธด ๋ถํฌ๋ฅผ ๊ฐ์ง๋ค๋ ๊ฐ์ ์ ๊ธฐ๋ฐ์ ๋๋ค. ์ด๋ฌํ ์ฑ๋ถ๋ค์ ๋ถ๋ฆฌํ๊ธฐ ์ํด ๊ฐ์ธํ ์ฃผ์ฑ๋ถ ๋ถ์ (robust principal component analysis, RPCA) ์ ์ด์ฉํ๋ ๋ฐฉ๋ฒ์ด ๋ํ์ ์ด๋ค. ๋ณธ ๋
ผ๋ฌธ์์๋ ๋ณด์ปฌ ๋ถ๋ฆฌ ์ฑ๋ฅ์ ์ด์ ์ ๋๊ณ RPCA ์๊ณ ๋ฆฌ์ฆ์ ์ผ๋ฐํํ๊ฑฐ๋ ํ์ฅํ๋ ๋ฐฉ์์ ๋ํด ๋
ผ์ํ๋ฉฐ, ํธ๋ ์ด์ค ๋
ธ๋ฆ๊ณผ l1 ๋
ธ๋ฆ์ ๊ฐ๊ฐ ์คํ
p ๋
ธ๋ฆ๊ณผ lp ๋
ธ๋ฆ์ผ๋ก ๋์ฒดํ๋ ๋ฐฉ๋ฒ, ์ค์ผ์ผ ์์ถ ๋ฐฉ๋ฒ, ์ฃผํ์ ๋ถํฌ ํน์ฑ์ ๋ฐ์ํ๋ ๋ฐฉ๋ฒ ๋ฑ์ ํฌํจํ๋ค. ์ ์ํ๋ ์๊ณ ๋ฆฌ์ฆ๋ค์ ๋ค์ํ ๋ฐ์ดํฐ์
๊ณผ ๋ํ์์ ํ๊ฐ๋์์ผ๋ฉฐ ์ต์ ์ ๋ณด์ปฌ ๋ถ๋ฆฌ ์๊ณ ๋ฆฌ์ฆ๋ค๋ณด๋ค ๋ ์ฐ์ํ๊ฑฐ๋ ๋น์ทํ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์๋ค.Singing voice separation (SVS) refers to the task or the method of decomposing music signal into singing voice and its accompanying instruments. It has various uses, from the preprocessing step, to extract the musical features implied in the target source, to applications for itself such as vocal training.
This thesis aims to discover the common properties of singing voice and accompaniment, and apply it to advance the state-of-the-art SVS algorithms. In particular, the separation approach as follows, which is named `characteristics-based,' is concentrated in this thesis. First, the music signal is assumed to be provided in monaural, or as a single-channel recording. It is more difficult condition compared to multiple-channel recording since spatial information cannot be applied in the separation procedure. This thesis also focuses on unsupervised approach, that does not use machine learning technique to estimate the source model from the training data. The models are instead derived based on the low-level characteristics and applied to the objective function. Finally, no external information such as lyrics, score, or user guide is provided. Unlike blind source separation problems, however, the classes of the target sources, singing voice and accompaniment, are known in SVS problem, and it allows to estimate those respective properties.
Three different characteristics are primarily discussed in this thesis. Continuity, in the spectral or temporal dimension, refers the smoothness of the source in the particular aspect. The spectral continuity is related with the timbre, while the temporal continuity represents the stability of sounds. On the other hand, the low-rankness refers how the signal is well-structured and can be represented as a low-rank data, and the sparsity represents how rarely the sounds in signals occur in time and frequency.
This thesis discusses two SVS approaches using above characteristics. First one is based on the continuity and sparsity, which extends the harmonic-percussive sound separation (HPSS). While the conventional algorithm separates singing voice by using a two-stage HPSS, the proposed one has a single stage procedure but with an additional sparse residual term in the objective function. Another SVS approach is based on the low-rankness and sparsity. Assuming that accompaniment can be represented as a low-rank model, whereas singing voice has a sparse distribution, conventional algorithm decomposes the sources by using robust principal component analysis (RPCA). In this thesis, generalization or extension of RPCA especially for SVS is discussed, including the use of Schatten p-/lp-norm, scale compression, and spectral distribution. The presented algorithms are evaluated using various datasets and challenges and achieved the better comparable results compared to the state-of-the-art algorithms.Chapter 1 Introduction 1
1.1 Motivation 4
1.2 Applications 5
1.3 Definitions and keywords 6
1.4 Evaluation criteria 7
1.5 Topics of interest 11
1.6 Outline of the thesis 13
Chapter 2 Background 15
2.1 Spectrogram-domain separation framework 15
2.2 Approaches for singing voice separation 19
2.2.1 Characteristics-based approach 20
2.2.2 Spatial approach 21
2.2.3 Machine learning-based approach 22
2.2.4 informed approach 23
2.3 Datasets and challenges 25
2.3.1 Datasets 25
2.3.2 Challenges 26
Chapter 3 Characteristics of music sources 28
3.1 Introduction 28
3.2 Spectral/temporal continuity 29
3.2.1 Continuity of a spectrogram 29
3.2.2 Continuity of musical sources 30
3.3 Low-rankness 31
3.3.1 Low-rankness of a spectrogram 31
3.3.2 Low-rankness of musical sources 33
3.4 Sparsity 34
3.4.1 Sparsity of a spectrogram 34
3.4.2 Sparsity of musical sources 36
3.5 Experiments 38
3.6 Summary 39
Chapter 4 Singing voice separation using continuity and sparsity 43
4.1 Introduction 43
4.2 SVS using two-stage HPSS 45
4.2.1 Harmonic-percussive sound separation 45
4.2.2 SVS using two-stage HPSS 46
4.3 Proposed algorithm 48
4.4 Experimental evaluation 52
4.4.1 MIR-1k Dataset 52
4.4.2 Beach boys Dataset 55
4.4.3 iKala dataset in MIREX 2014 56
4.5 Conclusion 58
Chapter 5 Singing voice separation using low-rankness and sparsity 61
5.1 Introduction 61
5.2 SVS using robust principal component analysis 63
5.2.1 Robust principal component analysis 63
5.2.2 Optimization for RPCA using augmented Lagrangian multiplier method 63
5.2.3 SVS using RPCA 65
5.3 SVS using generalized RPCA 67
5.3.1 Generalized RPCA using Schatten p- and lp-norm 67
5.3.2 Comparison of pRPCA with robust matrix completion 68
5.3.3 Optimization method of pRPCA 69
5.3.4 Discussion of the normalization factor for ฮป 69
5.3.5 Generalized RPCA using scale compression 71
5.3.6 Experimental results 72
5.4 SVS using RPCA and spectral distribution 73
5.4.1 RPCA with weighted l1-norm 73
5.4.2 Proposed method: SVS using wRPCA 74
5.4.3 Experimental results using DSD100 dataset 78
5.4.4 Comparison with state-of-the-arts in SiSEC 2016 79
5.4.5 Discussion 85
5.5 Summary 86
Chapter 6 Conclusion and Future Work 88
6.1 Conclusion 88
6.2 Contributions 89
6.3 Future work 91
6.3.1 Discovering various characteristics for SVS 91
6.3.2 Expanding to other SVS approaches 92
6.3.3 Applying the characteristics for deep learning models 92
Bibliography 94
์ด ๋ก 110Docto
A Critical Look at the Music Classification Experiment Pipeline: Using Interventions to Detect and Account for Confounding Effects
PhD ThesisThis dissertation focuses on the problemof confounding in the design and analysis of music
classification experiments. Classification experiments dominate evaluation of music
content analysis systems and methods, but achieving high performance on such experiments
does not guarantee systems properly address the intended problem. The research
presented here proposes and illustrates modifications to the conventional experimental
pipeline, which aim at improving the understanding of the evaluated systems and methods,
facilitating valid conclusions on their suitability for the target problem.
Firstly,multiple analyses are conducted to determinewhich cues scattering-based systems
use to predict the annotations of the GTZAN music genre collection. In-depth system
analysis informs empirical approaches that alter the experimental pipeline. In particular,
deflation manipulations and targeted interventions on the partitioning strategy,
the learning algorithm and the frequency content of the data reveal that systems using
scattering-based features exploit faults in GTZAN and previously unknown information
at inaudible frequencies.
Secondly, the use of interventions on the experimental pipeline is extended and systematised
to a procedure for characterising effects of confounding information in the
results of classification experiments. Regulated bootstrap, a novel resampling strategy,
is proposed to address challenges associated with interventions dealing with partitioning.
The procedure is demonstrated on GTZAN, analysing the effect of artist replication
and infrasonic information on performance measurements using a wide range of systemconstruction
methods.
Finally, mathematical models relating measurements from classification experiments
and potentially contributing factors are proposed and discussed. Suchmodels enable decomposing
measurements into contributions of interest, which may differ depending on
the goals of the study, including those from pipeline interventions. The adequacy for classification
experiments of some conventional assumptions underlying such models is also
examined.
The reported research highlights the need for evaluation procedures that go beyond
performance maximisation. Accounting for the effects of confounding information using
procedures grounded on the principles of experimental design promises to facilitate the
development of systems that generalise beyond the restricted experimental settings