56 research outputs found
Learning to Behave Like Clean Speech: Dual-Branch Knowledge Distillation for Noise-Robust Fake Audio Detection
Most research in fake audio detection (FAD) focuses on improving performance
on standard noise-free datasets. However, in actual situations, there is
usually noise interference, which will cause significant performance
degradation in FAD systems. To improve the noise robustness, we propose a
dual-branch knowledge distillation fake audio detection (DKDFAD) method.
Specifically, a parallel data flow of the clean teacher branch and the noisy
student branch is designed, and interactive fusion and response-based
teacher-student paradigms are proposed to guide the training of noisy data from
the data distribution and decision-making perspectives. In the noise branch,
speech enhancement is first introduced for denoising, which reduces the
interference of strong noise. The proposed interactive fusion combines
denoising features and noise features to reduce the impact of speech distortion
and seek consistency with the data distribution of clean branch. The
teacher-student paradigm maps the student's decision space to the teacher's
decision space, making noisy speech behave as clean. In addition, a joint
training method is used to optimize the two branches to achieve global
optimality. Experimental results based on multiple datasets show that the
proposed method performs well in noisy environments and maintains performance
in cross-dataset experiments
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
Speech enhancement and speech separation are two related tasks, whose purpose
is to extract either one or more target speech signals, respectively, from a
mixture of sounds generated by several sources. Traditionally, these tasks have
been tackled using signal processing and machine learning techniques applied to
the available acoustic signals. Since the visual aspect of speech is
essentially unaffected by the acoustic environment, visual information from the
target speakers, such as lip movements and facial expressions, has also been
used for speech enhancement and speech separation systems. In order to
efficiently fuse acoustic and visual information, researchers have exploited
the flexibility of data-driven approaches, specifically deep learning,
achieving strong performance. The ceaseless proposal of a large number of
techniques to extract features and fuse multimodal information has highlighted
the need for an overview that comprehensively describes and discusses
audio-visual speech enhancement and separation based on deep learning. In this
paper, we provide a systematic survey of this research topic, focusing on the
main elements that characterise the systems in the literature: acoustic
features; visual features; deep learning methods; fusion techniques; training
targets and objective functions. In addition, we review deep-learning-based
methods for speech reconstruction from silent videos and audio-visual sound
source separation for non-speech signals, since these methods can be more or
less directly applied to audio-visual speech enhancement and separation.
Finally, we survey commonly employed audio-visual speech datasets, given their
central role in the development of data-driven approaches, and evaluation
methods, because they are generally used to compare different systems and
determine their performance
Advanced Biometrics with Deep Learning
Biometrics, such as fingerprint, iris, face, hand print, hand vein, speech and gait recognition, etc., as a means of identity management have become commonplace nowadays for various applications. Biometric systems follow a typical pipeline, that is composed of separate preprocessing, feature extraction and classification. Deep learning as a data-driven representation learning approach has been shown to be a promising alternative to conventional data-agnostic and handcrafted pre-processing and feature extraction for biometric systems. Furthermore, deep learning offers an end-to-end learning paradigm to unify preprocessing, feature extraction, and recognition, based solely on biometric data. This Special Issue has collected 12 high-quality, state-of-the-art research papers that deal with challenging issues in advanced biometric systems based on deep learning. The 12 papers can be divided into 4 categories according to biometric modality; namely, face biometrics, medical electronic signals (EEG and ECG), voice print, and others
Leveraging ASR Pretrained Conformers for Speaker Verification through Transfer Learning and Knowledge Distillation
This paper explores the use of ASR-pretrained Conformers for speaker
verification, leveraging their strengths in modeling speech signals. We
introduce three strategies: (1) Transfer learning to initialize the speaker
embedding network, improving generalization and reducing overfitting. (2)
Knowledge distillation to train a more flexible speaker verification model,
incorporating frame-level ASR loss as an auxiliary task. (3) A lightweight
speaker adaptor for efficient feature conversion without altering the original
ASR Conformer, allowing parallel ASR and speaker verification. Experiments on
VoxCeleb show significant improvements: transfer learning yields a 0.48% EER,
knowledge distillation results in a 0.43% EER, and the speaker adaptor
approach, with just an added 4.92M parameters to a 130.94M-parameter model,
achieves a 0.57% EER. Overall, our methods effectively transfer ASR
capabilities to speaker verification tasks
AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation
Direct speech-to-speech translation (S2ST) aims to convert speech from one
language into another, and has demonstrated significant progress to date.
Despite the recent success, current S2ST models still suffer from distinct
degradation in noisy environments and fail to translate visual speech (i.e.,
the movement of lips and teeth). In this work, we present AV-TranSpeech, the
first audio-visual speech-to-speech (AV-S2ST) translation model without relying
on intermediate text. AV-TranSpeech complements the audio stream with visual
information to promote system robustness and opens up a host of practical
applications: dictation or dubbing archival films. To mitigate the data
scarcity with limited parallel AV-S2ST data, we 1) explore self-supervised
pre-training with unlabeled audio-visual data to learn contextual
representation, and 2) introduce cross-modal distillation with S2ST models
trained on the audio-only corpus to further reduce the requirements of visual
data. Experimental results on two language pairs demonstrate that AV-TranSpeech
outperforms audio-only models under all settings regardless of the type of
noise. With low-resource audio-visual data (10h, 30h), cross-modal distillation
yields an improvement of 7.6 BLEU on average compared with baselines. Audio
samples are available at https://AV-TranSpeech.github.ioComment: Accepted to ACL 202
๋นํ์ ์์์ ๊ฐ์ธํ ํ์ ์ธ์์ ์ํ ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ ์ฑ๋ฌธ ์ถ์ถ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ, 2021. 2. ๊น๋จ์.Over the recent years, various deep learning-based embedding methods have been proposed and have shown impressive performance in speaker verification. However, as in most of the classical embedding techniques, the deep learning-based methods are known to suffer from severe performance degradation when dealing with speech samples with different conditions (e.g., recording devices, emotional states). Also, unlike the classical Gaussian mixture model (GMM)-based techniques (e.g., GMM supervector or i-vector), since the deep learning-based embedding systems are trained in a fully supervised manner, it is impossible for them to utilize unlabeled dataset when training.
In this thesis, we propose a variational autoencoder (VAE)-based embedding framework, which extracts the total variability embedding and a representation for the uncertainty within the input speech distribution. Unlike the conventional deep learning-based embedding techniques (e.g., d-vector or x-vector), the proposed VAE-based embedding system is trained in an unsupervised manner, which enables the utilization of unlabeled datasets. Furthermore, in order to prevent the potential loss of information caused by the Kullback-Leibler divergence regularization term in the VAE-based embedding framework, we propose an adversarially learned inference (ALI)-based embedding technique. Both VAE- and ALI-based embedding techniques have shown great performance in terms of short duration speaker verification, outperforming the conventional i-vector framework.
Additionally, we present a fully supervised training method for disentangling the non-speaker nuisance information from the speaker embedding. The proposed training scheme jointly extracts the speaker and nuisance attribute (e.g., recording channel, emotion) embeddings, and train them to have maximum information on their main-task while ensuring maximum uncertainty on their sub-task. Since the proposed method does not require any heuristic training strategy as in the conventional disentanglement techniques (e.g., adversarial learning, gradient reversal), optimizing the embedding network is relatively more stable. The proposed scheme have shown state-of-the-art performance in RSR2015 Part 3 dataset, and demonstrated its capability in efficiently disentangling the recording device and emotional information from the speaker embedding.์ต๊ทผ ๋ช๋
๊ฐ ๋ค์ํ ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ๋ค์ด ์ ์๋์ด ์์ผ๋ฉฐ, ํ์ ์ธ์์์ ๋์ ์ฑ๋ฅ์ ๋ณด์๋ค. ํ์ง๋ง ๊ณ ์ ์ ์ธ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ์์์ ๋ง์ฐฌ๊ฐ์ง๋ก, ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ๋ค์ ์๋ก ๋ค๋ฅธ ํ๊ฒฝ (e.g., ๋
น์ ๊ธฐ๊ธฐ, ๊ฐ์ )์์ ๋
น์๋ ์์ฑ๋ค์ ๋ถ์ํ๋ ๊ณผ์ ์์ ์ฑ๋ฅ ์ ํ๋ฅผ ๊ฒช๋๋ค. ๋ํ ๊ธฐ์กด์ ๊ฐ์ฐ์์ ํผํฉ ๋ชจ๋ธ (Gaussian mixture model, GMM) ๊ธฐ๋ฐ์ ๊ธฐ๋ฒ๋ค (e.g., GMM ์ํผ๋ฒกํฐ, i-๋ฒกํฐ)์ ๋ฌ๋ฆฌ ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ๋ค์ ๊ต์ฌ ํ์ต์ ํตํ์ฌ ์ต์ ํ๋๊ธฐ์ ๋ผ๋ฒจ์ด ์๋ ๋ฐ์ดํฐ๋ฅผ ํ์ฉํ ์ ์๋ค๋ ํ๊ณ๊ฐ ์๋ค.
๋ณธ ๋
ผ๋ฌธ์์๋ variational autoencoder (VAE) ๊ธฐ๋ฐ์ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ์ ์ ์ํ๋ฉฐ, ํด๋น ๊ธฐ๋ฒ์์๋ ์์ฑ ๋ถํฌ ํจํด์ ์์ฝํ๋ ๋ฒกํฐ์ ์์ฑ ๋ด์ ๋ถํ์ค์ฑ์ ํํํ๋ ๋ฒกํฐ๋ฅผ ์ถ์ถํ๋ค. ๊ธฐ์กด์ ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ (e.g., d-๋ฒกํฐ, x-๋ฒกํฐ)์๋ ๋ฌ๋ฆฌ, ์ ์ํ๋ ๊ธฐ๋ฒ์ ๋น๊ต์ฌ ํ์ต์ ํตํ์ฌ ์ต์ ํ ๋๊ธฐ์ ๋ผ๋ฒจ์ด ์๋ ๋ฐ์ดํฐ๋ฅผ ํ์ฉํ ์ ์๋ค. ๋ ๋์๊ฐ VAE์ KL-divergence ์ ์ฝ ํจ์๋ก ์ธํ ์ ๋ณด ์์ค์ ๋ฐฉ์งํ๊ธฐ ์ํ์ฌ adversarially learned inference (ALI) ๊ธฐ๋ฐ์ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ์ ์ถ๊ฐ์ ์ผ๋ก ์ ์ํ๋ค. ์ ์ํ VAE ๋ฐ ALI ๊ธฐ๋ฐ์ ์ฑ๋ฌธ ์ถ์ถ ๊ธฐ๋ฒ์ ์งง์ ์์ฑ์์์ ํ์ ์ธ์ฆ ์คํ์์ ๋์ ์ฑ๋ฅ์ ๋ณด์์ผ๋ฉฐ, ๊ธฐ์กด์ i-๋ฒกํฐ ๊ธฐ๋ฐ์ ๊ธฐ๋ฒ๋ณด๋ค ์ข์ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์๋ค.
๋ํ ๋ณธ ๋
ผ๋ฌธ์์๋ ์ฑ๋ฌธ ๋ฒกํฐ๋ก๋ถํฐ ๋น ํ์ ์์ (e.g., ๋
น์ ๊ธฐ๊ธฐ, ๊ฐ์ )์ ๋ํ ์ ๋ณด๋ฅผ ์ ๊ฑฐํ๋ ํ์ต๋ฒ์ ์ ์ํ๋ค. ์ ์ํ๋ ๊ธฐ๋ฒ์ ํ์ ๋ฒกํฐ์ ๋นํ์ ๋ฒกํฐ๋ฅผ ๋์์ ์ถ์ถํ๋ฉฐ, ๊ฐ ๋ฒกํฐ๋ ์์ ์ ์ฃผ ๋ชฉ์ ์ ๋ํ ์ ๋ณด๋ฅผ ์ต๋ํ ๋ง์ด ์ ์งํ๋, ๋ถ ๋ชฉ์ ์ ๋ํ ์ ๋ณด๋ฅผ ์ต์ํํ๋๋ก ํ์ต๋๋ค. ๊ธฐ์กด์ ๋น ํ์ ์์ ์ ๋ณด ์ ๊ฑฐ ๊ธฐ๋ฒ๋ค (e.g., adversarial learning, gradient reversal)์ ๋นํ์ฌ ์ ์ํ๋ ๊ธฐ๋ฒ์ ํด๋ฆฌ์คํฑํ ํ์ต ์ ๋ต์ ์ํ์ง ์๊ธฐ์, ๋ณด๋ค ์์ ์ ์ธ ํ์ต์ด ๊ฐ๋ฅํ๋ค. ์ ์ํ๋ ๊ธฐ๋ฒ์ RSR2015 Part3 ๋ฐ์ดํฐ์
์์ ๊ธฐ์กด ๊ธฐ๋ฒ๋ค์ ๋นํ์ฌ ๋์ ์ฑ๋ฅ์ ๋ณด์์ผ๋ฉฐ, ์ฑ๋ฌธ ๋ฒกํฐ ๋ด์ ๋
น์ ๊ธฐ๊ธฐ ๋ฐ ๊ฐ์ ์ ๋ณด๋ฅผ ์ต์ ํ๋๋ฐ ํจ๊ณผ์ ์ด์๋ค.1. Introduction 1
2. Conventional embedding techniques for speaker recognition 7
2.1. i-vector framework 7
2.2. Deep learning-based speaker embedding 10
2.2.1. Deep embedding network 10
2.2.2. Conventional disentanglement methods 13
3. Unsupervised learning of total variability embedding for speaker verification with random digit strings 17
3.1. Introduction 17
3.2. Variational autoencoder 20
3.3. Variational inference model for non-linear total variability embedding 22
3.3.1. Maximum likelihood training 23
3.3.2. Non-linear feature extraction and speaker verification 25
3.4. Experiments 26
3.4.1. Databases 26
3.4.2. Experimental setup 27
3.4.3. Effect of the duration on the latent variable 28
3.4.4. Experiments with VAEs 30
3.4.5. Feature-level fusion of i-vector and latent variable 33
3.4.6. Score-level fusion of i-vector and latent variable 36
3.5. Summary 39
4. Adversarially learned total variability embedding for speaker recognition with random digit strings 41
4.1. Introduction 41
4.2. Adversarially learned inference 43
4.3. Adversarially learned feature extraction 45
4.3.1. Maximum likelihood criterion 47
4.3.2. Adversarially learned inference for non-linear i-vector extraction 49
4.3.3. Relationship to the VAE-based feature extractor 50
4.4. Experiments 51
4.4.1. Databases 51
4.4.2. Experimental setup 53
4.4.3. Effect of the duration on the latent variable 54
4.4.4. Speaker verification and identification with different utterance-level features 56
4.5. Summary 62
5. Disentangled speaker and nuisance attribute embedding for robust speaker verification 63
5.1. Introduction 63
5.2. Joint factor embedding 67
5.2.1. Joint factor embedding network architecture 67
5.2.2. Training for joint factor embedding 69
5.3. Experiments 71
5.3.1. Channel disentanglement experiments 71
5.3.2. Emotion disentanglement 82
5.3.3. Noise disentanglement 86
5.4. Summary 87
6. Conclusion 93
Bibliography 95
Abstract (Korean) 105Docto
- โฆ