Search CORE

483 research outputs found

Paralinguistic Privacy Protection at the Edge

Author: Aloufi Ranya
Boyle David
Haddadi Hamed
Publication venue
Publication date: 29/05/2021
Field of study

Voice user interfaces and digital assistants are rapidly entering our lives and becoming singular touch points spanning our devices. These always-on services capture and transmit our audio data to powerful cloud services for further processing and subsequent actions. Our voices and raw audio signals collected through these devices contain a host of sensitive paralinguistic information that is transmitted to service providers regardless of deliberate or false triggers. As our emotional patterns and sensitive attributes like our identity, gender, mental well-being, are easily inferred using deep acoustic models, we encounter a new generation of privacy risks by using these services. One approach to mitigate the risk of paralinguistic-based privacy breaches is to exploit a combination of cloud-based processing with privacy-preserving, on-device paralinguistic information learning and filtering before transmitting voice data. In this paper we introduce EDGY, a configurable, lightweight, disentangled representation learning framework that transforms and filters high-dimensional voice data to identify and contain sensitive attributes at the edge prior to offloading to the cloud. We evaluate EDGY's on-device performance and explore optimization techniques, including model quantization and knowledge distillation, to enable private, accurate and efficient representation learning on resource-constrained devices. Our results show that EDGY runs in tens of milliseconds with 0.2% relative improvement in ABX score or minimal performance penalties in learning linguistic representations from raw voice signals, using a CPU and a single-core ARM processor without specialized hardware.Comment: 14 pages, 7 figures. arXiv admin note: text overlap with arXiv:2007.1506

arXiv.org e-Print Archive

Spiral - Imperial College Digital Repository

Contrastive Speaker Embedding With Sequential Disentanglement

Author: Chien Jen-Tzung
Mak Man-Wai
Tu Youzhi
Publication venue
Publication date: 23/09/2023
Field of study

Contrastive speaker embedding assumes that the contrast between the positive and negative pairs of speech segments is attributed to speaker identity only. However, this assumption is incorrect because speech signals contain not only speaker identity but also linguistic content. In this paper, we propose a contrastive learning framework with sequential disentanglement to remove linguistic content by incorporating a disentangled sequential variational autoencoder (DSVAE) into the conventional SimCLR framework. The DSVAE aims to disentangle speaker factors from content factors in an embedding space so that only the speaker factors are used for constructing a contrastive loss objective. Because content factors have been removed from the contrastive learning, the resulting speaker embeddings will be content-invariant. Experimental results on VoxCeleb1-test show that the proposed method consistently outperforms SimCLR. This suggests that applying sequential disentanglement is beneficial to learning speaker-discriminative embeddings.Comment: Submitted to ICASSP 202

arXiv.org e-Print Archive

Exploring Disentanglement with Multilingual and Monolingual VQ-VAE

Author: Cooper Erica
Fong Jason
Williams Jennifer
Yamagishi Junichi
Publication venue
Publication date: 28/06/2021
Field of study

This work examines the content and usefulness of disentangled phone and speaker representations from two separately trained VQ-VAE systems: one trained on multilingual data and another trained on monolingual data. We explore the multi- and monolingual models using four small proof-of-concept tasks: copy-synthesis, voice transformation, linguistic code-switching, and content-based privacy masking. From these tasks, we reflect on how disentangled phone and speaker representations can be used to manipulate speech in a meaningful way. Our experiments demonstrate that the VQ representations are suitable for these tasks, including creating new voices by mixing speaker representations together. We also present our novel technique to conceal the content of targeted words within an utterance by manipulating phone VQ codes, while retaining speaker identity and intelligibility of surrounding words. Finally, we discuss recommendations for further increasing the viability of disentangled representations.Comment: Accepted to Speech Synthesis Workshop 2021 (SSW11

arXiv.org e-Print Archive

Southampton (e-Prints Soton)

제어 가능한 음성 합성을 위한 게이트 재귀 어텐션과 다변수 정보 최소화

Author: 천성준
Publication venue: 서울대학교 대학원
Publication date: 01/08/2021
Field of study

학위논문(박사) -- 서울대학교대학원 : 공과대학 전기·정보공학부, 2021.8. 천성준.Speech is one the most useful interface that enables a person to communicate with distant others while using hands for other tasks. With the growing usage of speech interfaces in mobile devices, home appliances, and automobiles, the research on human-machine speech interface is expanding. This thesis deals with the speech synthesis which enable machines to generate speech. With the application of deep learning technology, the quality of synthesized speech has become similar to that of human speech, but natural style control is still a challenging task. In this thesis, we propose novel techniques for expressing various styles such as prosody and emotion, and for controlling the style of synthesized speech factor-by-factor. First, the conventional style control techniques which have proposed for speech synthesis systems are introduced. In order to control speaker identity, emotion, accent, prosody, we introduce the control method both for statistical parametric-based and deep learning-based speech synthesis systems. We propose a gated recurrent attention (GRA), a novel attention mechanism with a controllable gated recurence. GRA is suitable for learning various styles because it can control the recurrent state for attention corresponds to the location with two gates. By experiments, GRA was found to be more effective in transferring unseen styles, which implies that the GRA outperform in generalization to conventional techniques. We propose a multivariate information minimization method which disentangle three or more latent representations. We show that control factors can be disentangled by minimizing interactive dependency which can be expressed as a sum of mutual information upper bound terms. Since the upper bound estimate converges from the early training stage, there is little performance degradation due to auxiliary loss. The proposed technique is applied to train a text-to-speech synthesizer with multi-lingual, multi-speaker, and multi-style corpora. Subjective listening tests validate the proposed method can improve the synthesizer in terms of quality as well as controllability.음성은 사람이 손으로 다른 일을 하면서도, 멀리 떨어진 상대와 활용할 수 있는 가장 유용한 인터페이스 중 하나이다. 대부분의 사람이 생활에서 밀접하게 접하는 모바일 기기, 가전, 자동차 등에서 음성 인터페이스를 활용하게 되면서, 기계와 사람 간의 음성 인터페이스에 대한 연구가 날로 증가하고 있다. 본 논문은 기계가 음성을 만드는 과정인 음성 합성을 다룬다. 딥 러닝 기술이 적용되면서 합성된 음성의 품질은 사람의 음성과 유사해졌지만, 자연스러운 스타일의 제어는 아직도 도전적인 과제이다. 본 논문에서는 다양한 운율과 감정을 표현할 수 있는 음성을 합성하기 위한 기법들을 제안하며, 스타일을 요소별로 제어하여 손쉽게 원하는 스타일의 음성을 합성할 수 있도록 하는 기법을 제안한다. 먼저 음성 합성을 위해 제안된 기존 스타일 제어 기법들을 소개한다. 화자, 감정, 말투나, 음운 등을 제어하면서도 자연스러운 발화를 합성하고자 통계적 파라미터 음성 합성 시스템을 위해 제안된 기법들과, 딥러닝 기반 음성 합성 시스템을 위해 제안된 기법을 소개한다. 다음으로 두 시퀀스(sequence) 간의 관계를 학습하여, 입력 시퀀스에 따라 출력 시퀀스를 생성하는 어텐션(attention) 기법에 제어 가능한 재귀성을 추가한 게이트 재귀 어텐션(Gated Recurrent Attention) 를 제안한다. 게이트 재귀 어텐션은 일정한 입력에 대해 출력 위치에 따라 달라지는 다양한 출력을 두 개의 게이트를 통해 제어할 수 있어 다양한 스타일을 학습하는데 적합하다. 게이트 재귀 어텐션은 학습 데이터에 없었던 스타일을 학습하고 생성하는데 있어 기존 기법에 비해 자연스러움이나 스타일 유사도 면에서 높은 성능을 보이는 것을 실험을 통해 확인할 수 있었다. 다음으로 세 개 이상의 스타일 요소들의 상호의존성을 제거할 수 있는 기법을 제안한다. 여러개의 제어 요소들(factors)을 변수간 상호의존성 상한 항들의 합으로 나타내고, 이를 최소화하여 의존성을 제거할 수 있음을 보인다. 이 상한 추정치는 학습 초기에 수렴하여 0에 가깝게 유지되기 때문에, 손실함수를 더함으로써 생기는 성능 저하가 거의 없다. 제안하는 기법은 다언어, 다화자, 스타일 데이터베이스로 음성합성기를 학습하는데 활용된다. 15명의 음성 전문가들의 주관적인 듣기 평가를 통해 제안하는 기법이 합성기의 스타일 제어가능성을 높일 뿐만 아니라 합성음의 품질까지 높일 수 있음을 보인다.1 Introduction 1 1.1 Evolution of Speech Synthesis Technology 1 1.2 Attention-based Speech Synthesis Systems 2 1.2.1 Tacotron 2 1.2.2 Deep Convolutional TTS 3 1.3 Non-autoregressive Speech Synthesis Systems 6 1.3.1 Glow-TTS 6 1.3.2 SpeedySpeech 8 1.4 Outline of the thesis 8 2 Style Modeling Techniques for Speech Synthesis 13 2.1 Introduction 13 2.2 Style Modeling Techniques for Statistical Parametric Speech Synthesis 14 2.3 Style Modeling Techniques for Deep Learning-based Speech Synthesis 15 2.4 Summary 17 3 Gated Recurrent Attention for Multi-Style Speech Synthesis 19 3.1 Introduction 19 3.2 Related Works 20 3.2.1 Gated recurrent unit 20 3.2.2 Location-sensitive attention 22 3.3 Gated Recurrent Attention 24 3.4 Experiments and results 28 3.4.1 Tacotron2 with global style tokens 28 3.4.2 Decaying guided attention 29 3.4.3 Datasets and feature processing 30 3.4.4 Evaluation methods 32 3.4.5 Evaluation results 33 3.5 Guided attention and decaying guided attention 34 3.6 Summary 35 4 A Controllable Multi-lingual Multi-speaker Multi-style Text-to-Speech Synthesis with Multivariate Information Minimization 41 4.1 Introduction 41 4.2 Related Works 44 4.2.1 Disentanglement Studies for Speech Synthesis 44 4.2.2 Total Correlation and Mutual Information 45 4.2.3 CLUB:A Contrastive Log-ratio Upper Bound of Mutual Information 46 4.3 Proposed method 46 4.4 Experiments and Results 47 4.4.1 Quality and Naturalness of Speech 51 4.4.2 Speaker and style similarity 52 4.5 Summary 53 5 Conclusions 55 Bibliography 57 초 록 67 감사의 글 69박

SNU Open Repository and Archive

Losses Can Be Blessings: Routing Self-Supervised Speech Representations Towards Efficient Multilingual and Multitask Speech Processing

Author: Fu Yonggan
Lai Cheng-I
Lin Yingyan
Qian Kaizhi
Ye Zhifan
Yu Zhongzhi
Zhang Yang
Publication venue
Publication date: 02/11/2022
Field of study

Self-supervised learning (SSL) for rich speech representations has achieved empirical success in low-resource Automatic Speech Recognition (ASR) and other speech processing tasks, which can mitigate the necessity of a large amount of transcribed speech and thus has driven a growing demand for on-device ASR and other speech processing. However, advanced speech SSL models have become increasingly large, which contradicts the limited on-device resources. This gap could be more severe in multilingual/multitask scenarios requiring simultaneously recognizing multiple languages or executing multiple speech processing tasks. Additionally, strongly overparameterized speech SSL models tend to suffer from overfitting when being finetuned on low-resource speech corpus. This work aims to enhance the practical usage of speech SSL models towards a win-win in both enhanced efficiency and alleviated overfitting via our proposed S

^3

-Router framework, which for the first time discovers that simply discarding no more than 10\% of model weights via only finetuning model connections of speech SSL models can achieve better accuracy over standard weight finetuning on downstream speech processing tasks. More importantly, S

^3

-Router can serve as an all-in-one technique to enable (1) a new finetuning scheme, (2) an efficient multilingual/multitask solution, (3) a state-of-the-art ASR pruning technique, and (4) a new tool to quantitatively analyze the learned speech representation. We believe S

^3

-Router has provided a new perspective for practical deployment of speech SSL models. Our codes are available at: https://github.com/GATECH-EIC/S3-Router.Comment: Accepted at NeurIPS 202

arXiv.org e-Print Archive