483 research outputs found
Paralinguistic Privacy Protection at the Edge
Voice user interfaces and digital assistants are rapidly entering our lives
and becoming singular touch points spanning our devices. These always-on
services capture and transmit our audio data to powerful cloud services for
further processing and subsequent actions. Our voices and raw audio signals
collected through these devices contain a host of sensitive paralinguistic
information that is transmitted to service providers regardless of deliberate
or false triggers. As our emotional patterns and sensitive attributes like our
identity, gender, mental well-being, are easily inferred using deep acoustic
models, we encounter a new generation of privacy risks by using these services.
One approach to mitigate the risk of paralinguistic-based privacy breaches is
to exploit a combination of cloud-based processing with privacy-preserving,
on-device paralinguistic information learning and filtering before transmitting
voice data. In this paper we introduce EDGY, a configurable, lightweight,
disentangled representation learning framework that transforms and filters
high-dimensional voice data to identify and contain sensitive attributes at the
edge prior to offloading to the cloud. We evaluate EDGY's on-device performance
and explore optimization techniques, including model quantization and knowledge
distillation, to enable private, accurate and efficient representation learning
on resource-constrained devices. Our results show that EDGY runs in tens of
milliseconds with 0.2% relative improvement in ABX score or minimal performance
penalties in learning linguistic representations from raw voice signals, using
a CPU and a single-core ARM processor without specialized hardware.Comment: 14 pages, 7 figures. arXiv admin note: text overlap with
arXiv:2007.1506
Contrastive Speaker Embedding With Sequential Disentanglement
Contrastive speaker embedding assumes that the contrast between the positive
and negative pairs of speech segments is attributed to speaker identity only.
However, this assumption is incorrect because speech signals contain not only
speaker identity but also linguistic content. In this paper, we propose a
contrastive learning framework with sequential disentanglement to remove
linguistic content by incorporating a disentangled sequential variational
autoencoder (DSVAE) into the conventional SimCLR framework. The DSVAE aims to
disentangle speaker factors from content factors in an embedding space so that
only the speaker factors are used for constructing a contrastive loss
objective. Because content factors have been removed from the contrastive
learning, the resulting speaker embeddings will be content-invariant.
Experimental results on VoxCeleb1-test show that the proposed method
consistently outperforms SimCLR. This suggests that applying sequential
disentanglement is beneficial to learning speaker-discriminative embeddings.Comment: Submitted to ICASSP 202
Exploring Disentanglement with Multilingual and Monolingual VQ-VAE
This work examines the content and usefulness of disentangled phone and
speaker representations from two separately trained VQ-VAE systems: one trained
on multilingual data and another trained on monolingual data. We explore the
multi- and monolingual models using four small proof-of-concept tasks:
copy-synthesis, voice transformation, linguistic code-switching, and
content-based privacy masking. From these tasks, we reflect on how disentangled
phone and speaker representations can be used to manipulate speech in a
meaningful way. Our experiments demonstrate that the VQ representations are
suitable for these tasks, including creating new voices by mixing speaker
representations together. We also present our novel technique to conceal the
content of targeted words within an utterance by manipulating phone VQ codes,
while retaining speaker identity and intelligibility of surrounding words.
Finally, we discuss recommendations for further increasing the viability of
disentangled representations.Comment: Accepted to Speech Synthesis Workshop 2021 (SSW11
μ μ΄ κ°λ₯ν μμ± ν©μ±μ μν κ²μ΄νΈ μ¬κ· μ΄ν μ κ³Ό λ€λ³μ μ 보 μ΅μν
νμλ
Όλ¬Έ(λ°μ¬) -- μμΈλνκ΅λνμ : 곡과λν μ κΈ°Β·μ 보곡νλΆ, 2021.8. μ²μ±μ€.Speech is one the most useful interface that enables a person to communicate with distant others while using hands for other tasks. With the growing usage of speech interfaces in mobile devices, home appliances, and automobiles, the research on human-machine speech interface is expanding. This thesis deals with the speech synthesis which enable machines to generate speech. With the application of deep learning technology, the quality of synthesized speech has become similar to that of human speech, but natural style control is still a challenging task. In this thesis, we propose novel techniques for expressing various styles such as prosody and emotion, and for controlling the style of synthesized speech factor-by-factor.
First, the conventional style control techniques which have proposed for speech synthesis systems are introduced. In order to control speaker identity, emotion, accent, prosody, we introduce the control method both for statistical parametric-based and deep learning-based speech synthesis systems.
We propose a gated recurrent attention (GRA), a novel attention mechanism with a controllable gated recurence. GRA is suitable for learning various styles because it can control the recurrent state for attention corresponds to the location with two gates. By experiments, GRA was found to be more effective in transferring unseen styles, which implies that the GRA outperform in generalization to conventional techniques.
We propose a multivariate information minimization method which disentangle three or more latent representations. We show that control factors can be disentangled by minimizing interactive dependency which can be expressed as a sum of mutual information upper bound terms. Since the upper bound estimate converges from the early training stage, there is little performance degradation due to auxiliary loss. The proposed technique is applied to train a text-to-speech synthesizer with multi-lingual, multi-speaker, and multi-style corpora. Subjective listening tests validate the proposed method can improve the synthesizer in terms of quality as well as controllability.μμ±μ μ¬λμ΄ μμΌλ‘ λ€λ₯Έ μΌμ νλ©΄μλ, λ©λ¦¬ λ¨μ΄μ§ μλμ νμ©ν μ μλ κ°μ₯ μ μ©ν μΈν°νμ΄μ€ μ€ νλμ΄λ€. λλΆλΆμ μ¬λμ΄ μνμμ λ°μ νκ² μ νλ λͺ¨λ°μΌ κΈ°κΈ°, κ°μ , μλμ°¨ λ±μμ μμ± μΈν°νμ΄μ€λ₯Ό νμ©νκ² λλ©΄μ, κΈ°κ³μ μ¬λ κ°μ μμ± μΈν°νμ΄μ€μ λν μ°κ΅¬κ° λ λ‘ μ¦κ°νκ³ μλ€. λ³Έ λ
Όλ¬Έμ κΈ°κ³κ° μμ±μ λ§λλ κ³Όμ μΈ μμ± ν©μ±μ λ€λ£¬λ€. λ₯ λ¬λ κΈ°μ μ΄ μ μ©λλ©΄μ ν©μ±λ μμ±μ νμ§μ μ¬λμ μμ±κ³Ό μ μ¬ν΄μ‘μ§λ§, μμ°μ€λ¬μ΄ μ€νμΌμ μ μ΄λ μμ§λ λμ μ μΈ κ³Όμ μ΄λ€. λ³Έ λ
Όλ¬Έμμλ λ€μν μ΄μ¨κ³Ό κ°μ μ ννν μ μλ μμ±μ ν©μ±νκΈ° μν κΈ°λ²λ€μ μ μνλ©°, μ€νμΌμ μμλ³λ‘ μ μ΄νμ¬ μμ½κ² μνλ μ€νμΌμ μμ±μ ν©μ±ν μ μλλ‘ νλ κΈ°λ²μ μ μνλ€.
λ¨Όμ μμ± ν©μ±μ μν΄ μ μλ κΈ°μ‘΄ μ€νμΌ μ μ΄ κΈ°λ²λ€μ μκ°νλ€. νμ, κ°μ , λ§ν¬λ, μμ΄ λ±μ μ μ΄νλ©΄μλ μμ°μ€λ¬μ΄ λ°νλ₯Ό ν©μ±νκ³ μ ν΅κ³μ νλΌλ―Έν° μμ± ν©μ± μμ€ν
μ μν΄ μ μλ κΈ°λ²λ€κ³Ό, λ₯λ¬λ κΈ°λ° μμ± ν©μ± μμ€ν
μ μν΄ μ μλ κΈ°λ²μ μκ°νλ€.
λ€μμΌλ‘ λ μνμ€(sequence) κ°μ κ΄κ³λ₯Ό νμ΅νμ¬, μ
λ ₯ μνμ€μ λ°λΌ μΆλ ₯ μνμ€λ₯Ό μμ±νλ μ΄ν
μ
(attention) κΈ°λ²μ μ μ΄ κ°λ₯ν μ¬κ·μ±μ μΆκ°ν κ²μ΄νΈ μ¬κ· μ΄ν
μ
(Gated Recurrent Attention) λ₯Ό μ μνλ€. κ²μ΄νΈ μ¬κ· μ΄ν
μ
μ μΌμ ν μ
λ ₯μ λν΄ μΆλ ₯ μμΉμ λ°λΌ λ¬λΌμ§λ λ€μν μΆλ ₯μ λ κ°μ κ²μ΄νΈλ₯Ό ν΅ν΄ μ μ΄ν μ μμ΄ λ€μν μ€νμΌμ νμ΅νλλ° μ ν©νλ€. κ²μ΄νΈ μ¬κ· μ΄ν
μ
μ νμ΅ λ°μ΄ν°μ μμλ μ€νμΌμ νμ΅νκ³ μμ±νλλ° μμ΄ κΈ°μ‘΄ κΈ°λ²μ λΉν΄ μμ°μ€λ¬μμ΄λ μ€νμΌ μ μ¬λ λ©΄μμ λμ μ±λ₯μ 보μ΄λ κ²μ μ€νμ ν΅ν΄ νμΈν μ μμλ€.
λ€μμΌλ‘ μΈ κ° μ΄μμ μ€νμΌ μμλ€μ μνΈμμ‘΄μ±μ μ κ±°ν μ μλ κΈ°λ²μ μ μνλ€. μ¬λ¬κ°μ μ μ΄ μμλ€(factors)μ λ³μκ° μνΈμμ‘΄μ± μν νλ€μ ν©μΌλ‘ λνλ΄κ³ , μ΄λ₯Ό μ΅μννμ¬ μμ‘΄μ±μ μ κ±°ν μ μμμ 보μΈλ€. μ΄ μν μΆμ μΉλ νμ΅ μ΄κΈ°μ μλ ΄νμ¬ 0μ κ°κΉκ² μ μ§λκΈ° λλ¬Έμ, μμ€ν¨μλ₯Ό λν¨μΌλ‘μ¨ μκΈ°λ μ±λ₯ μ νκ° κ±°μ μλ€. μ μνλ κΈ°λ²μ λ€μΈμ΄, λ€νμ, μ€νμΌ λ°μ΄ν°λ² μ΄μ€λ‘ μμ±ν©μ±κΈ°λ₯Ό νμ΅νλλ° νμ©λλ€. 15λͺ
μ μμ± μ λ¬Έκ°λ€μ μ£Όκ΄μ μΈ λ£κΈ° νκ°λ₯Ό ν΅ν΄ μ μνλ κΈ°λ²μ΄ ν©μ±κΈ°μ μ€νμΌ μ μ΄κ°λ₯μ±μ λμΌ λΏλ§ μλλΌ ν©μ±μμ νμ§κΉμ§ λμΌ μ μμμ 보μΈλ€.1 Introduction 1
1.1 Evolution of Speech Synthesis Technology 1
1.2 Attention-based Speech Synthesis Systems 2
1.2.1 Tacotron 2
1.2.2 Deep Convolutional TTS 3
1.3 Non-autoregressive Speech Synthesis Systems 6
1.3.1 Glow-TTS 6
1.3.2 SpeedySpeech 8
1.4 Outline of the thesis 8
2 Style Modeling Techniques for Speech Synthesis 13
2.1 Introduction 13
2.2 Style Modeling Techniques for Statistical Parametric Speech Synthesis 14
2.3 Style Modeling Techniques for Deep Learning-based Speech Synthesis 15
2.4 Summary 17
3 Gated Recurrent Attention for Multi-Style Speech Synthesis 19
3.1 Introduction 19
3.2 Related Works 20
3.2.1 Gated recurrent unit 20
3.2.2 Location-sensitive attention 22
3.3 Gated Recurrent Attention 24
3.4 Experiments and results 28
3.4.1 Tacotron2 with global style tokens 28
3.4.2 Decaying guided attention 29
3.4.3 Datasets and feature processing 30
3.4.4 Evaluation methods 32
3.4.5 Evaluation results 33
3.5 Guided attention and decaying guided attention 34
3.6 Summary 35
4 A Controllable Multi-lingual Multi-speaker Multi-style Text-to-Speech Synthesis with Multivariate Information Minimization 41
4.1 Introduction 41
4.2 Related Works 44
4.2.1 Disentanglement Studies for Speech Synthesis 44
4.2.2 Total Correlation and Mutual Information 45
4.2.3 CLUB:A Contrastive Log-ratio Upper Bound of Mutual Information 46
4.3 Proposed method 46
4.4 Experiments and Results 47
4.4.1 Quality and Naturalness of Speech 51
4.4.2 Speaker and style similarity 52
4.5 Summary 53
5 Conclusions 55
Bibliography 57
μ΄ λ‘ 67
κ°μ¬μ κΈ 69λ°
Losses Can Be Blessings: Routing Self-Supervised Speech Representations Towards Efficient Multilingual and Multitask Speech Processing
Self-supervised learning (SSL) for rich speech representations has achieved
empirical success in low-resource Automatic Speech Recognition (ASR) and other
speech processing tasks, which can mitigate the necessity of a large amount of
transcribed speech and thus has driven a growing demand for on-device ASR and
other speech processing. However, advanced speech SSL models have become
increasingly large, which contradicts the limited on-device resources. This gap
could be more severe in multilingual/multitask scenarios requiring
simultaneously recognizing multiple languages or executing multiple speech
processing tasks. Additionally, strongly overparameterized speech SSL models
tend to suffer from overfitting when being finetuned on low-resource speech
corpus. This work aims to enhance the practical usage of speech SSL models
towards a win-win in both enhanced efficiency and alleviated overfitting via
our proposed S-Router framework, which for the first time discovers that
simply discarding no more than 10\% of model weights via only finetuning model
connections of speech SSL models can achieve better accuracy over standard
weight finetuning on downstream speech processing tasks. More importantly,
S-Router can serve as an all-in-one technique to enable (1) a new
finetuning scheme, (2) an efficient multilingual/multitask solution, (3) a
state-of-the-art ASR pruning technique, and (4) a new tool to quantitatively
analyze the learned speech representation. We believe S-Router has provided
a new perspective for practical deployment of speech SSL models. Our codes are
available at: https://github.com/GATECH-EIC/S3-Router.Comment: Accepted at NeurIPS 202
- β¦