483 research outputs found

    Paralinguistic Privacy Protection at the Edge

    Full text link
    Voice user interfaces and digital assistants are rapidly entering our lives and becoming singular touch points spanning our devices. These always-on services capture and transmit our audio data to powerful cloud services for further processing and subsequent actions. Our voices and raw audio signals collected through these devices contain a host of sensitive paralinguistic information that is transmitted to service providers regardless of deliberate or false triggers. As our emotional patterns and sensitive attributes like our identity, gender, mental well-being, are easily inferred using deep acoustic models, we encounter a new generation of privacy risks by using these services. One approach to mitigate the risk of paralinguistic-based privacy breaches is to exploit a combination of cloud-based processing with privacy-preserving, on-device paralinguistic information learning and filtering before transmitting voice data. In this paper we introduce EDGY, a configurable, lightweight, disentangled representation learning framework that transforms and filters high-dimensional voice data to identify and contain sensitive attributes at the edge prior to offloading to the cloud. We evaluate EDGY's on-device performance and explore optimization techniques, including model quantization and knowledge distillation, to enable private, accurate and efficient representation learning on resource-constrained devices. Our results show that EDGY runs in tens of milliseconds with 0.2% relative improvement in ABX score or minimal performance penalties in learning linguistic representations from raw voice signals, using a CPU and a single-core ARM processor without specialized hardware.Comment: 14 pages, 7 figures. arXiv admin note: text overlap with arXiv:2007.1506

    Contrastive Speaker Embedding With Sequential Disentanglement

    Full text link
    Contrastive speaker embedding assumes that the contrast between the positive and negative pairs of speech segments is attributed to speaker identity only. However, this assumption is incorrect because speech signals contain not only speaker identity but also linguistic content. In this paper, we propose a contrastive learning framework with sequential disentanglement to remove linguistic content by incorporating a disentangled sequential variational autoencoder (DSVAE) into the conventional SimCLR framework. The DSVAE aims to disentangle speaker factors from content factors in an embedding space so that only the speaker factors are used for constructing a contrastive loss objective. Because content factors have been removed from the contrastive learning, the resulting speaker embeddings will be content-invariant. Experimental results on VoxCeleb1-test show that the proposed method consistently outperforms SimCLR. This suggests that applying sequential disentanglement is beneficial to learning speaker-discriminative embeddings.Comment: Submitted to ICASSP 202

    Exploring Disentanglement with Multilingual and Monolingual VQ-VAE

    Full text link
    This work examines the content and usefulness of disentangled phone and speaker representations from two separately trained VQ-VAE systems: one trained on multilingual data and another trained on monolingual data. We explore the multi- and monolingual models using four small proof-of-concept tasks: copy-synthesis, voice transformation, linguistic code-switching, and content-based privacy masking. From these tasks, we reflect on how disentangled phone and speaker representations can be used to manipulate speech in a meaningful way. Our experiments demonstrate that the VQ representations are suitable for these tasks, including creating new voices by mixing speaker representations together. We also present our novel technique to conceal the content of targeted words within an utterance by manipulating phone VQ codes, while retaining speaker identity and intelligibility of surrounding words. Finally, we discuss recommendations for further increasing the viability of disentangled representations.Comment: Accepted to Speech Synthesis Workshop 2021 (SSW11

    μ œμ–΄ κ°€λŠ₯ν•œ μŒμ„± 합성을 μœ„ν•œ 게이트 μž¬κ·€ μ–΄ν…μ…˜κ³Ό λ‹€λ³€μˆ˜ 정보 μ΅œμ†Œν™”

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 전기·정보곡학뢀, 2021.8. μ²œμ„±μ€€.Speech is one the most useful interface that enables a person to communicate with distant others while using hands for other tasks. With the growing usage of speech interfaces in mobile devices, home appliances, and automobiles, the research on human-machine speech interface is expanding. This thesis deals with the speech synthesis which enable machines to generate speech. With the application of deep learning technology, the quality of synthesized speech has become similar to that of human speech, but natural style control is still a challenging task. In this thesis, we propose novel techniques for expressing various styles such as prosody and emotion, and for controlling the style of synthesized speech factor-by-factor. First, the conventional style control techniques which have proposed for speech synthesis systems are introduced. In order to control speaker identity, emotion, accent, prosody, we introduce the control method both for statistical parametric-based and deep learning-based speech synthesis systems. We propose a gated recurrent attention (GRA), a novel attention mechanism with a controllable gated recurence. GRA is suitable for learning various styles because it can control the recurrent state for attention corresponds to the location with two gates. By experiments, GRA was found to be more effective in transferring unseen styles, which implies that the GRA outperform in generalization to conventional techniques. We propose a multivariate information minimization method which disentangle three or more latent representations. We show that control factors can be disentangled by minimizing interactive dependency which can be expressed as a sum of mutual information upper bound terms. Since the upper bound estimate converges from the early training stage, there is little performance degradation due to auxiliary loss. The proposed technique is applied to train a text-to-speech synthesizer with multi-lingual, multi-speaker, and multi-style corpora. Subjective listening tests validate the proposed method can improve the synthesizer in terms of quality as well as controllability.μŒμ„±μ€ μ‚¬λžŒμ΄ μ†μœΌλ‘œ λ‹€λ₯Έ 일을 ν•˜λ©΄μ„œλ„, 멀리 떨어진 μƒλŒ€μ™€ ν™œμš©ν•  수 μžˆλŠ” κ°€μž₯ μœ μš©ν•œ μΈν„°νŽ˜μ΄μŠ€ 쀑 ν•˜λ‚˜μ΄λ‹€. λŒ€λΆ€λΆ„μ˜ μ‚¬λžŒμ΄ μƒν™œμ—μ„œ λ°€μ ‘ν•˜κ²Œ μ ‘ν•˜λŠ” λͺ¨λ°”일 κΈ°κΈ°, κ°€μ „, μžλ™μ°¨ λ“±μ—μ„œ μŒμ„± μΈν„°νŽ˜μ΄μŠ€λ₯Ό ν™œμš©ν•˜κ²Œ λ˜λ©΄μ„œ, 기계와 μ‚¬λžŒ κ°„μ˜ μŒμ„± μΈν„°νŽ˜μ΄μŠ€μ— λŒ€ν•œ 연ꡬ가 λ‚ λ‘œ μ¦κ°€ν•˜κ³  μžˆλ‹€. λ³Έ 논문은 기계가 μŒμ„±μ„ λ§Œλ“œλŠ” 과정인 μŒμ„± 합성을 닀룬닀. λ”₯ λŸ¬λ‹ 기술이 μ μš©λ˜λ©΄μ„œ ν•©μ„±λœ μŒμ„±μ˜ ν’ˆμ§ˆμ€ μ‚¬λžŒμ˜ μŒμ„±κ³Ό μœ μ‚¬ν•΄μ‘Œμ§€λ§Œ, μžμ—°μŠ€λŸ¬μš΄ μŠ€νƒ€μΌμ˜ μ œμ–΄λŠ” 아직도 도전적인 κ³Όμ œμ΄λ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” λ‹€μ–‘ν•œ 운율과 감정을 ν‘œν˜„ν•  수 μžˆλŠ” μŒμ„±μ„ ν•©μ„±ν•˜κΈ° μœ„ν•œ 기법듀을 μ œμ•ˆν•˜λ©°, μŠ€νƒ€μΌμ„ μš”μ†Œλ³„λ‘œ μ œμ–΄ν•˜μ—¬ μ†μ‰½κ²Œ μ›ν•˜λŠ” μŠ€νƒ€μΌμ˜ μŒμ„±μ„ ν•©μ„±ν•  수 μžˆλ„λ‘ ν•˜λŠ” 기법을 μ œμ•ˆν•œλ‹€. λ¨Όμ € μŒμ„± 합성을 μœ„ν•΄ μ œμ•ˆλœ κΈ°μ‘΄ μŠ€νƒ€μΌ μ œμ–΄ 기법듀을 μ†Œκ°œν•œλ‹€. ν™”μž, 감정, λ§νˆ¬λ‚˜, 음운 등을 μ œμ–΄ν•˜λ©΄μ„œλ„ μžμ—°μŠ€λŸ¬μš΄ λ°œν™”λ₯Ό ν•©μ„±ν•˜κ³ μž 톡계적 νŒŒλΌλ―Έν„° μŒμ„± ν•©μ„± μ‹œμŠ€ν…œμ„ μœ„ν•΄ μ œμ•ˆλœ 기법듀과, λ”₯λŸ¬λ‹ 기반 μŒμ„± ν•©μ„± μ‹œμŠ€ν…œμ„ μœ„ν•΄ μ œμ•ˆλœ 기법을 μ†Œκ°œν•œλ‹€. λ‹€μŒμœΌλ‘œ 두 μ‹œν€€μŠ€(sequence) κ°„μ˜ 관계λ₯Ό ν•™μŠ΅ν•˜μ—¬, μž…λ ₯ μ‹œν€€μŠ€μ— 따라 좜λ ₯ μ‹œν€€μŠ€λ₯Ό μƒμ„±ν•˜λŠ” μ–΄ν…μ…˜(attention) 기법에 μ œμ–΄ κ°€λŠ₯ν•œ μž¬κ·€μ„±μ„ μΆ”κ°€ν•œ 게이트 μž¬κ·€ μ–΄ν…μ…˜(Gated Recurrent Attention) λ₯Ό μ œμ•ˆν•œλ‹€. 게이트 μž¬κ·€ μ–΄ν…μ…˜μ€ μΌμ •ν•œ μž…λ ₯에 λŒ€ν•΄ 좜λ ₯ μœ„μΉ˜μ— 따라 λ‹¬λΌμ§€λŠ” λ‹€μ–‘ν•œ 좜λ ₯을 두 개의 게이트λ₯Ό 톡해 μ œμ–΄ν•  수 μžˆμ–΄ λ‹€μ–‘ν•œ μŠ€νƒ€μΌμ„ ν•™μŠ΅ν•˜λŠ”λ° μ ν•©ν•˜λ‹€. 게이트 μž¬κ·€ μ–΄ν…μ…˜μ€ ν•™μŠ΅ 데이터에 μ—†μ—ˆλ˜ μŠ€νƒ€μΌμ„ ν•™μŠ΅ν•˜κ³  μƒμ„±ν•˜λŠ”λ° μžˆμ–΄ κΈ°μ‘΄ 기법에 λΉ„ν•΄ μžμ—°μŠ€λŸ¬μ›€μ΄λ‚˜ μŠ€νƒ€μΌ μœ μ‚¬λ„ λ©΄μ—μ„œ 높은 μ„±λŠ₯을 λ³΄μ΄λŠ” 것을 μ‹€ν—˜μ„ 톡해 확인할 수 μžˆμ—ˆλ‹€. λ‹€μŒμœΌλ‘œ μ„Έ 개 μ΄μƒμ˜ μŠ€νƒ€μΌ μš”μ†Œλ“€μ˜ μƒν˜Έμ˜μ‘΄μ„±μ„ μ œκ±°ν•  수 μžˆλŠ” 기법을 μ œμ•ˆν•œλ‹€. μ—¬λŸ¬κ°œμ˜ μ œμ–΄ μš”μ†Œλ“€(factors)을 λ³€μˆ˜κ°„ μƒν˜Έμ˜μ‘΄μ„± μƒν•œ ν•­λ“€μ˜ ν•©μœΌλ‘œ λ‚˜νƒ€λ‚΄κ³ , 이λ₯Ό μ΅œμ†Œν™”ν•˜μ—¬ μ˜μ‘΄μ„±μ„ μ œκ±°ν•  수 μžˆμŒμ„ 보인닀. 이 μƒν•œ μΆ”μ •μΉ˜λŠ” ν•™μŠ΅ μ΄ˆκΈ°μ— μˆ˜λ ΄ν•˜μ—¬ 0에 κ°€κΉκ²Œ μœ μ§€λ˜κΈ° λ•Œλ¬Έμ—, μ†μ‹€ν•¨μˆ˜λ₯Ό λ”ν•¨μœΌλ‘œμ¨ μƒκΈ°λŠ” μ„±λŠ₯ μ €ν•˜κ°€ 거의 μ—†λ‹€. μ œμ•ˆν•˜λŠ” 기법은 λ‹€μ–Έμ–΄, λ‹€ν™”μž, μŠ€νƒ€μΌ λ°μ΄ν„°λ² μ΄μŠ€λ‘œ μŒμ„±ν•©μ„±κΈ°λ₯Ό ν•™μŠ΅ν•˜λŠ”λ° ν™œμš©λœλ‹€. 15λͺ…μ˜ μŒμ„± μ „λ¬Έκ°€λ“€μ˜ 주관적인 λ“£κΈ° 평가λ₯Ό 톡해 μ œμ•ˆν•˜λŠ” 기법이 ν•©μ„±κΈ°μ˜ μŠ€νƒ€μΌ μ œμ–΄κ°€λŠ₯성을 높일 뿐만 μ•„λ‹ˆλΌ ν•©μ„±μŒμ˜ ν’ˆμ§ˆκΉŒμ§€ 높일 수 μžˆμŒμ„ 보인닀.1 Introduction 1 1.1 Evolution of Speech Synthesis Technology 1 1.2 Attention-based Speech Synthesis Systems 2 1.2.1 Tacotron 2 1.2.2 Deep Convolutional TTS 3 1.3 Non-autoregressive Speech Synthesis Systems 6 1.3.1 Glow-TTS 6 1.3.2 SpeedySpeech 8 1.4 Outline of the thesis 8 2 Style Modeling Techniques for Speech Synthesis 13 2.1 Introduction 13 2.2 Style Modeling Techniques for Statistical Parametric Speech Synthesis 14 2.3 Style Modeling Techniques for Deep Learning-based Speech Synthesis 15 2.4 Summary 17 3 Gated Recurrent Attention for Multi-Style Speech Synthesis 19 3.1 Introduction 19 3.2 Related Works 20 3.2.1 Gated recurrent unit 20 3.2.2 Location-sensitive attention 22 3.3 Gated Recurrent Attention 24 3.4 Experiments and results 28 3.4.1 Tacotron2 with global style tokens 28 3.4.2 Decaying guided attention 29 3.4.3 Datasets and feature processing 30 3.4.4 Evaluation methods 32 3.4.5 Evaluation results 33 3.5 Guided attention and decaying guided attention 34 3.6 Summary 35 4 A Controllable Multi-lingual Multi-speaker Multi-style Text-to-Speech Synthesis with Multivariate Information Minimization 41 4.1 Introduction 41 4.2 Related Works 44 4.2.1 Disentanglement Studies for Speech Synthesis 44 4.2.2 Total Correlation and Mutual Information 45 4.2.3 CLUB:A Contrastive Log-ratio Upper Bound of Mutual Information 46 4.3 Proposed method 46 4.4 Experiments and Results 47 4.4.1 Quality and Naturalness of Speech 51 4.4.2 Speaker and style similarity 52 4.5 Summary 53 5 Conclusions 55 Bibliography 57 초 둝 67 κ°μ‚¬μ˜ κΈ€ 69λ°•

    Losses Can Be Blessings: Routing Self-Supervised Speech Representations Towards Efficient Multilingual and Multitask Speech Processing

    Full text link
    Self-supervised learning (SSL) for rich speech representations has achieved empirical success in low-resource Automatic Speech Recognition (ASR) and other speech processing tasks, which can mitigate the necessity of a large amount of transcribed speech and thus has driven a growing demand for on-device ASR and other speech processing. However, advanced speech SSL models have become increasingly large, which contradicts the limited on-device resources. This gap could be more severe in multilingual/multitask scenarios requiring simultaneously recognizing multiple languages or executing multiple speech processing tasks. Additionally, strongly overparameterized speech SSL models tend to suffer from overfitting when being finetuned on low-resource speech corpus. This work aims to enhance the practical usage of speech SSL models towards a win-win in both enhanced efficiency and alleviated overfitting via our proposed S3^3-Router framework, which for the first time discovers that simply discarding no more than 10\% of model weights via only finetuning model connections of speech SSL models can achieve better accuracy over standard weight finetuning on downstream speech processing tasks. More importantly, S3^3-Router can serve as an all-in-one technique to enable (1) a new finetuning scheme, (2) an efficient multilingual/multitask solution, (3) a state-of-the-art ASR pruning technique, and (4) a new tool to quantitatively analyze the learned speech representation. We believe S3^3-Router has provided a new perspective for practical deployment of speech SSL models. Our codes are available at: https://github.com/GATECH-EIC/S3-Router.Comment: Accepted at NeurIPS 202
    • …
    corecore