6 research outputs found

    A Dataset of Norwegian Hardanger Fiddle Recordings with Precise Annotation of Note and Beat Onsets

    Get PDF
    The Hardanger fiddle is a variety of the violin used in the folk music of the western and central part of southern Norway. This paper presents a dataset of several hours of recordings of Hardanger fiddle music, with note annotations of onsets, offsets and pitches, provided by the performers themselves. A subset has also been annotated with beat onset positions by the performer as well as three expert musicians. The complexity of the music genreโ€”polyphonic, highly ornamented and with a very irregular pulsation, among other aspectsโ€”motivated the design of a new annotation software adapted to these particular needs. Beat annotation in MIR is typically recorded as positions in seconds, without explicit connection with actual musical events. In the context of music where the rhythm is carried by the melodic instrument alone, a more reliable definition of beat onsets consists in associating them with the onsets of the notes that represent the start of each beat. This latter definition of beat onsets reflects that beats are generated from within the flow of played melodic-rhythmic events, which implies that the spacing of beats may be shifting and irregular. This motivated the design of a new method for beat annotation in Hardanger fiddle music based on a selection of notes in the note annotation. Comparisons between annotators through alignmentโ€”integrated in the interfaceโ€”enable them to eventually correct their annotations or observe alternative valid interpretations of any given excerpt. After dedicating a part of the note annotation dataset to the training of a machine learning model, for the task of assessing both note pitch and onset time, an F1 score of 87% can be reached. The beat annotation dataset demonstrates the necessity of developing new beat trackers adapted to Hardanger fiddle music. The dataset as well as the annotation software is made publicly available

    ๊ณ ์œ  ํŠน์„ฑ์„ ํ™œ์šฉํ•œ ์Œ์•…์—์„œ์˜ ๋ณด์ปฌ ๋ถ„๋ฆฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์œตํ•ฉ๊ณผํ•™๋ถ€, 2018. 2. ์ด๊ต๊ตฌ.๋ณด์ปฌ ๋ถ„๋ฆฌ๋ž€ ์Œ์•… ์‹ ํ˜ธ๋ฅผ ๋ณด์ปฌ ์„ฑ๋ถ„๊ณผ ๋ฐ˜์ฃผ ์„ฑ๋ถ„์œผ๋กœ ๋ถ„๋ฆฌํ•˜๋Š” ์ผ ๋˜๋Š” ๊ทธ ๋ฐฉ๋ฒ•์„ ์˜๋ฏธํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๊ธฐ์ˆ ์€ ์Œ์•…์˜ ํŠน์ •ํ•œ ์„ฑ๋ถ„์— ๋‹ด๊ฒจ ์žˆ๋Š” ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•œ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์—์„œ๋ถ€ํ„ฐ, ๋ณด์ปฌ ์—ฐ์Šต๊ณผ ๊ฐ™์ด ๋ถ„๋ฆฌ ์Œ์› ์ž์ฒด๋ฅผ ํ™œ์šฉํ•˜๋Š” ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ๋ชฉ์ ์œผ๋กœ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์˜ ๋ชฉ์ ์€ ๋ณด์ปฌ๊ณผ ๋ฐ˜์ฃผ๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๊ณ ์œ ํ•œ ํŠน์„ฑ์— ๋Œ€ํ•ด ๋…ผ์˜ํ•˜๊ณ  ๊ทธ๊ฒƒ์„ ํ™œ์šฉํ•˜์—ฌ ๋ณด์ปฌ ๋ถ„๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์„ ๊ฐœ๋ฐœํ•˜๋Š” ๊ฒƒ์ด๋ฉฐ, ํŠนํžˆ `ํŠน์ง• ๊ธฐ๋ฐ˜' ์ด๋ผ๊ณ  ๋ถˆ๋ฆฌ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ƒํ™ฉ์— ๋Œ€ํ•ด ์ค‘์ ์ ์œผ๋กœ ๋…ผ์˜ํ•œ๋‹ค. ์šฐ์„  ๋ถ„๋ฆฌ ๋Œ€์ƒ์ด ๋˜๋Š” ์Œ์•… ์‹ ํ˜ธ๋Š” ๋‹จ์ฑ„๋„๋กœ ์ œ๊ณต๋œ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฉฐ, ์ด ๊ฒฝ์šฐ ์‹ ํ˜ธ์˜ ๊ณต๊ฐ„์  ์ •๋ณด๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค์ฑ„๋„ ํ™˜๊ฒฝ์— ๋น„ํ•ด ๋”์šฑ ์–ด๋ ค์šด ํ™˜๊ฒฝ์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ ๊ธฐ๊ณ„ ํ•™์Šต ๋ฐฉ๋ฒ•์œผ๋กœ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ๊ฐ ์Œ์›์˜ ๋ชจ๋ธ์„ ์ถ”์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์ œํ•˜๋ฉฐ, ๋Œ€์‹  ์ €์ฐจ์›์˜ ํŠน์„ฑ๋“ค๋กœ๋ถ€ํ„ฐ ๋ชจ๋ธ์„ ์œ ๋„ํ•˜์—ฌ ์ด๋ฅผ ๋ชฉํ‘œ ํ•จ์ˆ˜์— ๋ฐ˜์˜ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‹œ๋„ํ•œ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ๊ฐ€์‚ฌ, ์•…๋ณด, ์‚ฌ์šฉ์ž์˜ ์•ˆ๋‚ด ๋“ฑ๊ณผ ๊ฐ™์€ ์™ธ๋ถ€์˜ ์ •๋ณด ์—ญ์‹œ ์ œ๊ณต๋˜์ง€ ์•Š๋Š”๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ณด์ปฌ ๋ถ„๋ฆฌ์˜ ๊ฒฝ์šฐ ์•”๋ฌต ์Œ์› ๋ถ„๋ฆฌ ๋ฌธ์ œ์™€๋Š” ๋‹ฌ๋ฆฌ ๋ถ„๋ฆฌํ•˜๊ณ ์ž ํ•˜๋Š” ์Œ์›์ด ๊ฐ๊ฐ ๋ณด์ปฌ๊ณผ ๋ฐ˜์ฃผ์— ํ•ด๋‹นํ•œ๋‹ค๋Š” ์ตœ์†Œํ•œ์˜ ์ •๋ณด๋Š” ์ œ๊ณต๋˜๋ฏ€๋กœ ๊ฐ๊ฐ์˜ ์„ฑ์งˆ๋“ค์— ๋Œ€ํ•œ ๋ถ„์„์€ ๊ฐ€๋Šฅํ•˜๋‹ค. ํฌ๊ฒŒ ์„ธ ์ข…๋ฅ˜์˜ ํŠน์„ฑ์ด ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ค‘์ ์ ์œผ๋กœ ๋…ผ์˜๋œ๋‹ค. ์šฐ์„  ์—ฐ์†์„ฑ์˜ ๊ฒฝ์šฐ ์ฃผํŒŒ์ˆ˜ ๋˜๋Š” ์‹œ๊ฐ„ ์ธก๋ฉด์œผ๋กœ ๊ฐ๊ฐ ๋…ผ์˜๋  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ฃผํŒŒ์ˆ˜์ถ• ์—ฐ์†์„ฑ์˜ ๊ฒฝ์šฐ ์†Œ๋ฆฌ์˜ ์Œ์ƒ‰์  ํŠน์„ฑ์„, ์‹œ๊ฐ„์ถ• ์—ฐ์†์„ฑ์€ ์†Œ๋ฆฌ๊ฐ€ ์•ˆ์ •์ ์œผ๋กœ ์ง€์†๋˜๋Š” ์ •๋„๋ฅผ ๊ฐ๊ฐ ๋‚˜ํƒ€๋‚ธ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ, ์ €ํ–‰๋ ฌ๊ณ„์ˆ˜ ํŠน์„ฑ์€ ์‹ ํ˜ธ์˜ ๊ตฌ์กฐ์  ์„ฑ์งˆ์„ ๋ฐ˜์˜ํ•˜๋ฉฐ ํ•ด๋‹น ์‹ ํ˜ธ๊ฐ€ ๋‚ฎ์€ ํ–‰๋ ฌ๊ณ„์ˆ˜๋ฅผ ๊ฐ€์ง€๋Š” ํ˜•ํƒœ๋กœ ํ‘œํ˜„๋  ์ˆ˜ ์žˆ๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ์„ฑ๊น€ ํŠน์„ฑ์€ ์‹ ํ˜ธ์˜ ๋ถ„ํฌ ํ˜•ํƒœ๊ฐ€ ์–ผ๋งˆ๋‚˜ ์„ฑ๊ธฐ๊ฑฐ๋‚˜ ์กฐ๋ฐ€ํ•œ์ง€๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํฌ๊ฒŒ ๋‘ ๊ฐ€์ง€์˜ ๋ณด์ปฌ ๋ถ„๋ฆฌ ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ๋…ผ์˜ํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์€ ์—ฐ์†์„ฑ๊ณผ ์„ฑ๊น€ ํŠน์„ฑ์— ๊ธฐ๋ฐ˜์„ ๋‘๊ณ  ํ™”์„ฑ ์•…๊ธฐ-ํƒ€์•…๊ธฐ ๋ถ„๋ฆฌ ๋ฐฉ๋ฒ• (harmonic-percussive sound separation, HPSS) ์„ ํ™•์žฅํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ๊ธฐ์กด์˜ ๋ฐฉ๋ฒ•์ด ๋‘ ๋ฒˆ์˜ HPSS ๊ณผ์ •์„ ํ†ตํ•ด ๋ณด์ปฌ์„ ๋ถ„๋ฆฌํ•˜๋Š” ๊ฒƒ์— ๋น„ํ•ด ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์„ฑ๊ธด ์ž”์—ฌ ์„ฑ๋ถ„์„ ์ถ”๊ฐ€ํ•ด ํ•œ ๋ฒˆ์˜ ๋ณด์ปฌ ๋ถ„๋ฆฌ ๊ณผ์ •๋งŒ์„ ์‚ฌ์šฉํ•œ๋‹ค. ๋…ผ์˜๋˜๋Š” ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์€ ์ €ํ–‰๋ ฌ๊ณ„์ˆ˜ ํŠน์„ฑ๊ณผ ์„ฑ๊น€ ํŠน์„ฑ์„ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์œผ๋กœ, ๋ฐ˜์ฃผ๊ฐ€ ์ €ํ–‰๋ ฌ๊ณ„์ˆ˜ ๋ชจ๋ธ๋กœ ํ‘œํ˜„๋  ์ˆ˜ ์žˆ๋Š” ๋ฐ˜๋ฉด ๋ณด์ปฌ์€ ์„ฑ๊ธด ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง„๋‹ค๋Š” ๊ฐ€์ •์— ๊ธฐ๋ฐ˜์„ ๋‘”๋‹ค. ์ด๋Ÿฌํ•œ ์„ฑ๋ถ„๋“ค์„ ๋ถ„๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ•์ธํ•œ ์ฃผ์„ฑ๋ถ„ ๋ถ„์„ (robust principal component analysis, RPCA) ์„ ์ด์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๋Œ€ํ‘œ์ ์ด๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋ณด์ปฌ ๋ถ„๋ฆฌ ์„ฑ๋Šฅ์— ์ดˆ์ ์„ ๋‘๊ณ  RPCA ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ผ๋ฐ˜ํ™”ํ•˜๊ฑฐ๋‚˜ ํ™•์žฅํ•˜๋Š” ๋ฐฉ์‹์— ๋Œ€ํ•ด ๋…ผ์˜ํ•˜๋ฉฐ, ํŠธ๋ ˆ์ด์Šค ๋…ธ๋ฆ„๊ณผ l1 ๋…ธ๋ฆ„์„ ๊ฐ๊ฐ ์ƒคํ… p ๋…ธ๋ฆ„๊ณผ lp ๋…ธ๋ฆ„์œผ๋กœ ๋Œ€์ฒดํ•˜๋Š” ๋ฐฉ๋ฒ•, ์Šค์ผ€์ผ ์••์ถ• ๋ฐฉ๋ฒ•, ์ฃผํŒŒ์ˆ˜ ๋ถ„ํฌ ํŠน์„ฑ์„ ๋ฐ˜์˜ํ•˜๋Š” ๋ฐฉ๋ฒ• ๋“ฑ์„ ํฌํ•จํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์€ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹๊ณผ ๋Œ€ํšŒ์—์„œ ํ‰๊ฐ€๋˜์—ˆ์œผ๋ฉฐ ์ตœ์‹ ์˜ ๋ณด์ปฌ ๋ถ„๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค๋ณด๋‹ค ๋” ์šฐ์ˆ˜ํ•˜๊ฑฐ๋‚˜ ๋น„์Šทํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๋‹ค.Singing voice separation (SVS) refers to the task or the method of decomposing music signal into singing voice and its accompanying instruments. It has various uses, from the preprocessing step, to extract the musical features implied in the target source, to applications for itself such as vocal training. This thesis aims to discover the common properties of singing voice and accompaniment, and apply it to advance the state-of-the-art SVS algorithms. In particular, the separation approach as follows, which is named `characteristics-based,' is concentrated in this thesis. First, the music signal is assumed to be provided in monaural, or as a single-channel recording. It is more difficult condition compared to multiple-channel recording since spatial information cannot be applied in the separation procedure. This thesis also focuses on unsupervised approach, that does not use machine learning technique to estimate the source model from the training data. The models are instead derived based on the low-level characteristics and applied to the objective function. Finally, no external information such as lyrics, score, or user guide is provided. Unlike blind source separation problems, however, the classes of the target sources, singing voice and accompaniment, are known in SVS problem, and it allows to estimate those respective properties. Three different characteristics are primarily discussed in this thesis. Continuity, in the spectral or temporal dimension, refers the smoothness of the source in the particular aspect. The spectral continuity is related with the timbre, while the temporal continuity represents the stability of sounds. On the other hand, the low-rankness refers how the signal is well-structured and can be represented as a low-rank data, and the sparsity represents how rarely the sounds in signals occur in time and frequency. This thesis discusses two SVS approaches using above characteristics. First one is based on the continuity and sparsity, which extends the harmonic-percussive sound separation (HPSS). While the conventional algorithm separates singing voice by using a two-stage HPSS, the proposed one has a single stage procedure but with an additional sparse residual term in the objective function. Another SVS approach is based on the low-rankness and sparsity. Assuming that accompaniment can be represented as a low-rank model, whereas singing voice has a sparse distribution, conventional algorithm decomposes the sources by using robust principal component analysis (RPCA). In this thesis, generalization or extension of RPCA especially for SVS is discussed, including the use of Schatten p-/lp-norm, scale compression, and spectral distribution. The presented algorithms are evaluated using various datasets and challenges and achieved the better comparable results compared to the state-of-the-art algorithms.Chapter 1 Introduction 1 1.1 Motivation 4 1.2 Applications 5 1.3 Definitions and keywords 6 1.4 Evaluation criteria 7 1.5 Topics of interest 11 1.6 Outline of the thesis 13 Chapter 2 Background 15 2.1 Spectrogram-domain separation framework 15 2.2 Approaches for singing voice separation 19 2.2.1 Characteristics-based approach 20 2.2.2 Spatial approach 21 2.2.3 Machine learning-based approach 22 2.2.4 informed approach 23 2.3 Datasets and challenges 25 2.3.1 Datasets 25 2.3.2 Challenges 26 Chapter 3 Characteristics of music sources 28 3.1 Introduction 28 3.2 Spectral/temporal continuity 29 3.2.1 Continuity of a spectrogram 29 3.2.2 Continuity of musical sources 30 3.3 Low-rankness 31 3.3.1 Low-rankness of a spectrogram 31 3.3.2 Low-rankness of musical sources 33 3.4 Sparsity 34 3.4.1 Sparsity of a spectrogram 34 3.4.2 Sparsity of musical sources 36 3.5 Experiments 38 3.6 Summary 39 Chapter 4 Singing voice separation using continuity and sparsity 43 4.1 Introduction 43 4.2 SVS using two-stage HPSS 45 4.2.1 Harmonic-percussive sound separation 45 4.2.2 SVS using two-stage HPSS 46 4.3 Proposed algorithm 48 4.4 Experimental evaluation 52 4.4.1 MIR-1k Dataset 52 4.4.2 Beach boys Dataset 55 4.4.3 iKala dataset in MIREX 2014 56 4.5 Conclusion 58 Chapter 5 Singing voice separation using low-rankness and sparsity 61 5.1 Introduction 61 5.2 SVS using robust principal component analysis 63 5.2.1 Robust principal component analysis 63 5.2.2 Optimization for RPCA using augmented Lagrangian multiplier method 63 5.2.3 SVS using RPCA 65 5.3 SVS using generalized RPCA 67 5.3.1 Generalized RPCA using Schatten p- and lp-norm 67 5.3.2 Comparison of pRPCA with robust matrix completion 68 5.3.3 Optimization method of pRPCA 69 5.3.4 Discussion of the normalization factor for ฮป 69 5.3.5 Generalized RPCA using scale compression 71 5.3.6 Experimental results 72 5.4 SVS using RPCA and spectral distribution 73 5.4.1 RPCA with weighted l1-norm 73 5.4.2 Proposed method: SVS using wRPCA 74 5.4.3 Experimental results using DSD100 dataset 78 5.4.4 Comparison with state-of-the-arts in SiSEC 2016 79 5.4.5 Discussion 85 5.5 Summary 86 Chapter 6 Conclusion and Future Work 88 6.1 Conclusion 88 6.2 Contributions 89 6.3 Future work 91 6.3.1 Discovering various characteristics for SVS 91 6.3.2 Expanding to other SVS approaches 92 6.3.3 Applying the characteristics for deep learning models 92 Bibliography 94 ์ดˆ ๋ก 110Docto

    Underdetermined convolutive source separation using two dimensional non-negative factorization techniques

    Get PDF
    PhD ThesisIn this thesis the underdetermined audio source separation has been considered, that is, estimating the original audio sources from the observed mixture when the number of audio sources is greater than the number of channels. The separation has been carried out using two approaches; the blind audio source separation and the informed audio source separation. The blind audio source separation approach depends on the mixture signal only and it assumes that the separation has been accomplished without any prior information (or as little as possible) about the sources. The informed audio source separation uses the exemplar in addition to the mixture signal to emulate the targeted speech signal to be separated. Both approaches are based on the two dimensional factorization techniques that decompose the signal into two tensors that are convolved in both the temporal and spectral directions. Both approaches are applied on the convolutive mixture and the high-reverberant convolutive mixture which are more realistic than the instantaneous mixture. In this work a novel algorithm based on the nonnegative matrix factor two dimensional deconvolution (NMF2D) with adaptive sparsity has been proposed to separate the audio sources that have been mixed in an underdetermined convolutive mixture. Additionally, a novel Gamma Exponential Process has been proposed for estimating the convolutive parameters and number of components of the NMF2D/ NTF2D, and to initialize the NMF2D parameters. In addition, the effects of different window length have been investigated to determine the best fit model that suit the characteristics of the audio signal. Furthermore, a novel algorithm, namely the fusion K models of full-rank weighted nonnegative tensor factor two dimensional deconvolution (K-wNTF2D) has been proposed. The K-wNTF2D is developed for its ability in modelling both the spectral and temporal changes, and the spatial covariance matrix that addresses the high reverberation problem. Variable sparsity that derived from the Gibbs distribution is optimized under the Itakura-Saito divergence and adapted into the K-wNTF2D model. The tensors of this algorithm have been initialized by a novel initialization method, namely the SVD two-dimensional deconvolution (SVD2D). Finally, two novel informed source separation algorithms, namely, the semi-exemplar based algorithm and the exemplar-based algorithm, have been proposed. These algorithms based on the NMF2D model and the proposed two dimensional nonnegative matrix partial co-factorization (2DNMPCF) model. The idea of incorporating the exemplar is to inform the proposed separation algorithms about the targeted signal to be separated by initializing its parameters and guide the proposed separation algorithms. The adaptive sparsity is derived for both ii of the proposed algorithms. Also, a multistage of the proposed exemplar based algorithm has been proposed in order to further enhance the separation performance. Results have shown that the proposed separation algorithms are very promising, more flexible, and offer an alternative model to the conventional methods

    Single channel overlapped-speech detection and separation of spontaneous conversations

    Get PDF
    PhD ThesisIn the thesis, spontaneous conversation containing both speech mixture and speech dialogue is considered. The speech mixture refers to speakers speaking simultaneously (i.e. the overlapped-speech). The speech dialogue refers to only one speaker is actively speaking and the other is silent. That Input conversation is firstly processed by the overlapped-speech detection. Two output signals are then segregated into dialogue and mixture formats. The dialogue is processed by speaker diarization. Its outputs are the individual speech of each speaker. The mixture is processed by speech separation. Its outputs are independent separated speech signals of the speaker. When the separation input contains only the mixture, blind speech separation approach is used. When the separation is assisted by the outputs of the speaker diarization, it is informed speech separation. The research presents novel: overlapped-speech detection algorithm, and two speech separation algorithms. The proposed overlapped-speech detection is an algorithm to estimate the switching instants of the input. Optimization loop is adapted to adopt the best capsulated audio features and to avoid the worst. The optimization depends on principles of the pattern recognition, and k-means clustering. For of 300 simulated conversations, averages of: False-Alarm Error is 1.9%, Missed-Speech Error is 0.4%, and Overlap-Speaker Error is 1%. Approximately, these errors equal the errors of best recent reliable speaker diarization corpuses. The proposed blind speech separation algorithm consists of four sequential techniques: filter-bank analysis, Non-negative Matrix Factorization (NMF), speaker clustering and filter-bank synthesis. Instead of the required speaker segmentation, effective standard framing is contributed. Average obtained objective tests (SAR, SDR and SIR) of 51 simulated conversations are: 5.06dB, 4.87dB and 12.47dB respectively. For the proposed informed speech separation algorithm, outputs of the speaker diarization are a generated-database. The database associated the speech separation by creating virtual targeted-speech and mixture. The contributed virtual signals are trained to facilitate the separation by homogenising them with the NMF-matrix elements of the real mixture. Contributed masking optimized the resulting speech. Average obtained SAR, SDR and SIR of 341 simulated conversations are 9.55dB, 1.12dB, and 2.97dB respectively. Per the objective tests of the two speech separation algorithms, they are in the mid-range of the well-known NMF-based audio and speech separation methods
    corecore