11,187 research outputs found

    Emotional adaptive training for speaker verification

    Full text link
    Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Bie, F., Wang, D., Zheng, T.F., Tejedor, J., Chen, R. "Emotional adaptive training for speaker verification", in Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2013 Asia-Pacific, 2013, pp. 1-4Speaker verification suffers from significant performance degradation with emotion variation. In a previous study, we have demonstrated that an adaptation approach based on MLLR/CMLLR can provide a significant performance improvement for verification on emotional speech. This paper follows this direction and presents an emotional adaptive training (EAT) approach. This approach iteratively estimates the emotion-dependent CMLLR transformations and re-trains the speaker models with the transformed speech, which therefore can make use of emotional enrollment speech to train a stronger speaker model. This is similar to the speaker adaptive training (SAT) in speech recognition. The experiments are conducted on an emotional speech database which involves speech recordings of 30 speakers in 5 emotions. The results demonstrate that the EAT approach provides significant performance improvements over the baseline system where the neutral enrollment data are used to train the speaker models and the emotional test utterances are verified directly. The EAT also outperforms another two emotionadaptation approaches in a significant way: (1) the CMLLR-based approach where the speaker models are trained with the neutral enrollment speech and the emotional test utterances are transformed by CMLLR in verification; (2) the MAP-based approach where the emotional enrollment data are used to train emotion-dependent speaker models and the emotional utterances are verified based on the emotion-matched models.This work was supported by the National Natural Science Foundation of China under Grant No. 61271389 and the National Basic Research Program (973 Program) of China under Grant No. 2013CB329302

    Glottal Source Cepstrum Coefficients Applied to NIST SRE 2010

    Get PDF
    Through the present paper, a novel feature set for speaker recognition based on glottal estimate information is presented. An iterative algorithm is used to derive the vocal tract and glottal source estimations from speech signal. In order to test the importance of glottal source information in speaker characterization, the novel feature set has been tested in the 2010 NIST Speaker Recognition Evaluation (NIST SRE10). The proposed system uses glottal estimate parameter templates and classical cepstral information to build a model for each speaker involved in the recognition process. ALIZE [1] open-source software has been used to create the GMM models for both background and target speakers. Compared to using mel-frequency cepstrum coefficients (MFCC), the misclassification rate for the NIST SRE 2010 reduced from 29.43% to 27.15% when glottal source features are use

    Cross-lingual speech emotion recognition through factor analysis

    Get PDF

    Prosodic and spectral iVectors for expressive speech synthesis

    Get PDF
    This work presents a study on the suitability of prosodic andacoustic features, with a special focus on i-vectors, in expressivespeech analysis and synthesis. For each utterance of two dif-ferent databases, a laboratory recorded emotional acted speech,and an audiobook, several prosodic and acoustic features are ex-tracted. Among them, i-vectors are built not only on the MFCCbase, but also on F0, power and syllable durations. Then, un-supervised clustering is performed using different feature com-binations. The resulting clusters are evaluated calculating clus-ter entropy for labeled portions of the databases. Additionally,synthetic voices are trained, applying speaker adaptive training,from the clusters built from the audiobook. The voices are eval-uated in a perceptual test where the participants have to edit anaudiobook paragraph using the synthetic voices.The objective results suggest that i-vectors are very use-ful for the audiobook, where different speakers (book charac-ters) are imitated. On the other hand, for the laboratory record-ings, traditional prosodic features outperform i-vectors. Also,a closer analysis of the created clusters suggest that differentspeakers use different prosodic and acoustic means to conveyemotions. The perceptual results suggest that the proposed i-vector based feature combinations can be used for audiobookclustering and voice training.Peer ReviewedPostprint (published version

    ๋”ฅ๋Ÿฌ๋‹์„ ํ™œ์šฉํ•œ ์Šคํƒ€์ผ ์ ์‘ํ˜• ์Œ์„ฑ ํ•ฉ์„ฑ ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2020. 8. ๊น€๋‚จ์ˆ˜.The neural network-based speech synthesis techniques have been developed over the years. Although neural speech synthesis has shown remarkable generated speech quality, there are still remaining problems such as modeling power in a neural statistical parametric speech synthesis system, style expressiveness, and robust attention model in the end-to-end speech synthesis system. In this thesis, novel alternatives are proposed to resolve these drawbacks of the conventional neural speech synthesis system. In the first approach, we propose an adversarially trained variational recurrent neural network (AdVRNN), which applies a variational recurrent neural network (VRNN) to represent the variability of natural speech for acoustic modeling in neural statistical parametric speech synthesis. Also, we apply an adversarial learning scheme in training AdVRNN to overcome the oversmoothing problem. From the experimental results, we have found that the proposed AdVRNN based method outperforms the conventional RNN-based techniques. In the second approach, we propose a novel style modeling method employing mutual information neural estimator (MINE) in a style-adaptive end-to-end speech synthesis system. MINE is applied to increase target-style information and suppress text information in style embedding by applying MINE loss term in the loss function. The experimental results show that the MINE-based method has shown promising performance in both speech quality and style similarity for the global style token-Tacotron. In the third approach, we propose a novel attention method called memory attention for end-to-end speech synthesis, which is inspired by the gating mechanism of long-short term memory (LSTM). Leveraging the gating technique's sequence modeling power in LSTM, memory attention obtains the stable alignment from the content-based and location-based features. We evaluate the memory attention and compare its performance with various conventional attention techniques in single speaker and emotional speech synthesis scenarios. From the results, we conclude that memory attention can generate speech with large variability robustly. In the last approach, we propose selective multi-attention for style-adaptive end-to-end speech synthesis systems. The conventional single attention model may limit the expressivity representing numerous alignment paths depending on style. To achieve a variation in attention alignment, we propose using a multi-attention model with a selection network. The multi-attention plays a role in generating candidates for the target style, and the selection network choose the most proper attention among the multi-attention. The experimental results show that selective multi-attention outperforms the conventional single attention techniques in multi-speaker speech synthesis and emotional speech synthesis.๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ์Œ์„ฑ ํ•ฉ์„ฑ ๊ธฐ์ˆ ์€ ์ง€๋‚œ ๋ช‡ ๋…„๊ฐ„ ํš”๋ฐœํ•˜๊ฒŒ ๊ฐœ๋ฐœ๋˜๊ณ  ์žˆ๋‹ค. ๋”ฅ๋Ÿฌ๋‹์˜ ๋‹ค์–‘ํ•œ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์Œ์„ฑ ํ•ฉ์„ฑ ํ’ˆ์งˆ์€ ๋น„์•ฝ์ ์œผ๋กœ ๋ฐœ์ „ํ–ˆ์ง€๋งŒ, ์•„์ง ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ์Œ์„ฑ ํ•ฉ์„ฑ์—๋Š” ์—ฌ๋Ÿฌ ๋ฌธ์ œ๊ฐ€ ์กด์žฌํ•œ๋‹ค. ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ํ†ต๊ณ„์  ํŒŒ๋ผ๋ฏธํ„ฐ ๊ธฐ๋ฒ•์˜ ๊ฒฝ์šฐ ์Œํ–ฅ ๋ชจ๋ธ์˜ deterministicํ•œ ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ๋ง ๋Šฅ๋ ฅ์˜ ํ•œ๊ณ„๊ฐ€ ์žˆ์œผ๋ฉฐ, ์ข…๋‹จํ˜• ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ์Šคํƒ€์ผ์„ ํ‘œํ˜„ํ•˜๋Š” ๋Šฅ๋ ฅ๊ณผ ๊ฐ•์ธํ•œ ์–ดํ…์…˜(attention)์— ๋Œ€ํ•œ ์ด์Šˆ๊ฐ€ ๋Š์ž„์—†์ด ์žฌ๊ธฐ๋˜๊ณ  ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๊ธฐ์กด์˜ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์Œ์„ฑ ํ•ฉ์„ฑ ์‹œ์Šคํ…œ์˜ ๋‹จ์ ์„ ํ•ด๊ฒฐํ•  ์ƒˆ๋กœ์šด ๋Œ€์•ˆ์„ ์ œ์•ˆํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, ๋‰ด๋Ÿด ํ†ต๊ณ„์  ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฐฉ์‹์˜ ์Œํ–ฅ ๋ชจ๋ธ๋ง์„ ๊ณ ๋„ํ™”ํ•˜๊ธฐ ์œ„ํ•œ adversarially trained variational recurrent neural network (AdVRNN) ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. AdVRNN ๊ธฐ๋ฒ•์€ VRNN์„ ์Œ์„ฑ ํ•ฉ์„ฑ์— ์ ์šฉํ•˜์—ฌ ์Œ์„ฑ์˜ ๋ณ€ํ™”๋ฅผ stochastic ํ•˜๊ณ  ์ž์„ธํ•˜๊ฒŒ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์˜€๋‹ค. ๋˜ํ•œ, ์ ๋Œ€์  ํ•™์Šต์ (adversarial learning) ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•˜์—ฌ oversmoothing ๋ฌธ์ œ๋ฅผ ์ตœ์†Œํ™” ์‹œํ‚ค๋„๋ก ํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ์ œ์•ˆ๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ธฐ์กด์˜ ์ˆœํ™˜ ์‹ ๊ฒฝ๋ง ๊ธฐ๋ฐ˜์˜ ์Œํ–ฅ ๋ชจ๋ธ๊ณผ ๋น„๊ตํ•˜์—ฌ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋จ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋‘ ๋ฒˆ์งธ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, ์Šคํƒ€์ผ ์ ์‘ํ˜• ์ข…๋‹จํ˜• ์Œ์„ฑ ํ•ฉ์„ฑ ๊ธฐ๋ฒ•์„ ์œ„ํ•œ ์ƒํ˜ธ ์ •๋ณด๋Ÿ‰ ๊ธฐ๋ฐ˜์˜ ์ƒˆ๋กœ์šด ํ•™์Šต ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๊ธฐ์กด์˜ global style token(GST) ๊ธฐ๋ฐ˜์˜ ์Šคํƒ€์ผ ์Œ์„ฑ ํ•ฉ์„ฑ ๊ธฐ๋ฒ•์˜ ๊ฒฝ์šฐ, ๋น„์ง€๋„ ํ•™์Šต์„ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ ์›ํ•˜๋Š” ๋ชฉํ‘œ ์Šคํƒ€์ผ์ด ์žˆ์–ด๋„ ์ด๋ฅผ ์ค‘์ ์ ์œผ๋กœ ํ•™์Šต์‹œํ‚ค๊ธฐ ์–ด๋ ค์› ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด GST์˜ ์ถœ๋ ฅ๊ณผ ๋ชฉํ‘œ ์Šคํƒ€์ผ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์˜ ์ƒํ˜ธ ์ •๋ณด๋Ÿ‰์„ ์ตœ๋Œ€ํ™” ํ•˜๋„๋ก ํ•™์Šต ์‹œํ‚ค๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ƒํ˜ธ ์ •๋ณด๋Ÿ‰์„ ์ข…๋‹จํ˜• ๋ชจ๋ธ์˜ ์†์‹คํ•จ์ˆ˜์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ mutual information neural estimator(MINE) ๊ธฐ๋ฒ•์„ ๋„์ž…ํ•˜์˜€๊ณ  ๋‹คํ™”์ž ๋ชจ๋ธ์„ ํ†ตํ•ด ๊ธฐ์กด์˜ GST ๊ธฐ๋ฒ•์— ๋น„ํ•ด ๋ชฉํ‘œ ์Šคํƒ€์ผ์„ ๋ณด๋‹ค ์ค‘์ ์ ์œผ๋กœ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค. ์„ธ๋ฒˆ์งธ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, ๊ฐ•์ธํ•œ ์ข…๋‹จํ˜• ์Œ์„ฑ ํ•ฉ์„ฑ์˜ ์–ดํ…์…˜์ธ memory attention์„ ์ œ์•ˆํ•œ๋‹ค. Long-short term memory(LSTM)์˜ gating ๊ธฐ์ˆ ์€ sequence๋ฅผ ๋ชจ๋ธ๋งํ•˜๋Š”๋ฐ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์™”๋‹ค. ์ด๋Ÿฌํ•œ ๊ธฐ์ˆ ์„ ์–ดํ…์…˜์— ์ ์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์Šคํƒ€์ผ์„ ๊ฐ€์ง„ ์Œ์„ฑ์—์„œ๋„ ์–ดํ…์…˜์˜ ๋Š๊น€, ๋ฐ˜๋ณต ๋“ฑ์„ ์ตœ์†Œํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๋‹จ์ผ ํ™”์ž์™€ ๊ฐ์ • ์Œ์„ฑ ํ•ฉ์„ฑ ๊ธฐ๋ฒ•์„ ํ† ๋Œ€๋กœ memory attention์˜ ์„ฑ๋Šฅ์„ ํ™•์ธํ•˜์˜€์œผ๋ฉฐ ๊ธฐ์กด ๊ธฐ๋ฒ• ๋Œ€๋น„ ๋ณด๋‹ค ์•ˆ์ •์ ์ธ ์–ดํ…์…˜ ๊ณก์„ ์„ ์–ป์„ ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, selective multi-attention (SMA)์„ ํ™œ์šฉํ•œ ์Šคํƒ€์ผ ์ ์‘ํ˜• ์ข…๋‹จํ˜• ์Œ์„ฑ ํ•ฉ์„ฑ ์–ดํ…์…˜ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๊ธฐ์กด์˜ ์Šคํƒ€์ผ ์ ์‘ํ˜• ์ข…๋‹จํ˜• ์Œ์„ฑ ํ•ฉ์„ฑ์˜ ์—ฐ๊ตฌ์—์„œ๋Š” ๋‚ญ๋…์ฒด ๋‹จ์ผํ™”์ž์˜ ๊ฒฝ์šฐ์™€ ๊ฐ™์€ ๋‹จ์ผ ์–ดํ…์…˜์„ ์‚ฌ์šฉํ•˜์—ฌ ์™”๋‹ค. ํ•˜์ง€๋งŒ ์Šคํƒ€์ผ ์Œ์„ฑ์˜ ๊ฒฝ์šฐ ๋ณด๋‹ค ๋‹ค์–‘ํ•œ ์–ดํ…์…˜ ํ‘œํ˜„์„ ์š”๊ตฌํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋‹ค์ค‘ ์–ดํ…์…˜์„ ํ™œ์šฉํ•˜์—ฌ ํ›„๋ณด๋“ค์„ ์ƒ์„ฑํ•˜๊ณ  ์ด๋ฅผ ์„ ํƒ ๋„คํŠธ์›Œํฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ตœ์ ์˜ ์–ดํ…์…˜์„ ์„ ํƒํ•˜๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. SMA ๊ธฐ๋ฒ•์€ ๊ธฐ์กด์˜ ์–ดํ…์…˜๊ณผ์˜ ๋น„๊ต ์‹คํ—˜์„ ํ†ตํ•˜์—ฌ ๋ณด๋‹ค ๋งŽ์€ ์Šคํƒ€์ผ์„ ์•ˆ์ •์ ์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค.1 Introduction 1 1.1 Background 1 1.2 Scope of thesis 3 2 Neural Speech Synthesis System 7 2.1 Overview of a Neural Statistical Parametric Speech Synthesis System 7 2.2 Overview of End-to-end Speech Synthesis System 9 2.3 Tacotron2 10 2.4 Attention Mechanism 12 2.4.1 Location Sensitive Attention 12 2.4.2 Forward Attention 13 2.4.3 Dynamic Convolution Attention 14 3 Neural Statistical Parametric Speech Synthesis using AdVRNN 17 3.1 Introduction 17 3.2 Background 19 3.2.1 Variational Autoencoder 19 3.2.2 Variational Recurrent Neural Network 20 3.3 Speech Synthesis Using AdVRNN 22 3.3.1 AdVRNN based Acoustic Modeling 23 3.3.2 Training Procedure 24 3.4 Experiments 25 3.4.1 Objective performance evaluation 28 3.4.2 Subjective performance evaluation 29 3.5 Summary 29 4 Speech Style Modeling Method using Mutual Information for End-to-End Speech Synthesis 31 4.1 Introduction 31 4.2 Background 33 4.2.1 Mutual Information 33 4.2.2 Mutual Information Neural Estimator 34 4.2.3 Global Style Token 34 4.3 Style Token end-to-end speech synthesis using MINE 35 4.4 Experiments 36 4.5 Summary 38 5 Memory Attention: Robust Alignment using Gating Mechanism for End-to-End Speech Synthesis 45 5.1 Introduction 45 5.2 BACKGROUND 48 5.3 Memory Attention 49 5.4 Experiments 52 5.4.1 Experiments on Single Speaker Speech Synthesis 53 5.4.2 Experiments on Emotional Speech Synthesis 56 5.5 Summary 59 6 Selective Multi-attention for style-adaptive end-to-End Speech Syn-thesis 63 6.1 Introduction 63 6.2 BACKGROUND 65 6.3 Selective multi-attention model 66 6.4 EXPERIMENTS 67 6.4.1 Multi-speaker speech synthesis experiments 68 6.4.2 Experiments on Emotional Speech Synthesis 73 6.5 Summary 77 7 Conclusions 79 Bibliography 83 ์š”์•ฝ 93 ๊ฐ์‚ฌ์˜ ๊ธ€ 95Docto
    • โ€ฆ
    corecore