1,829 research outputs found
๊ฐ์ธํ ์์ฑ์ธ์์ ์ํ DNN ๊ธฐ๋ฐ ์ํฅ ๋ชจ๋ธ๋ง
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ปดํจํฐ๊ณตํ๋ถ, 2019. 2. ๊น๋จ์.๋ณธ ๋
ผ๋ฌธ์์๋ ๊ฐ์ธํ ์์ฑ์ธ์์ ์ํด์ DNN์ ํ์ฉํ ์ํฅ ๋ชจ๋ธ๋ง ๊ธฐ๋ฒ๋ค์ ์ ์ํ๋ค. ๋ณธ ๋
ผ๋ฌธ์์๋ ํฌ๊ฒ ์ธ ๊ฐ์ง์ DNN ๊ธฐ๋ฐ ๊ธฐ๋ฒ์ ์ ์ํ๋ค. ์ฒซ ๋ฒ์งธ๋ DNN์ด ๊ฐ์ง๊ณ ์๋ ์ก์ ํ๊ฒฝ์ ๋ํ ๊ฐ์ธํจ์ ๋ณด์กฐ ํน์ง ๋ฒกํฐ๋ค์ ํตํ์ฌ ์ต๋๋ก ํ์ฉํ๋ ์ํฅ ๋ชจ๋ธ๋ง ๊ธฐ๋ฒ์ด๋ค. ์ด๋ฌํ ๊ธฐ๋ฒ์ ํตํ์ฌ DNN์ ์๊ณก๋ ์์ฑ, ๊นจ๋ํ ์์ฑ, ์ก์ ์ถ์ ์น, ๊ทธ๋ฆฌ๊ณ ์์ ํ๊ฒ๊ณผ์ ๋ณต์กํ ๊ด๊ณ๋ฅผ ๋ณด๋ค ์ํํ๊ฒ ํ์ตํ๊ฒ ๋๋ค. ๋ณธ ๊ธฐ๋ฒ์ Aurora-5 DB ์์ ๊ธฐ์กด์ ๋ณด์กฐ ์ก์ ํน์ง ๋ฒกํฐ๋ฅผ ํ์ฉํ ๋ชจ๋ธ ์ ์ ๊ธฐ๋ฒ์ธ ์ก์ ์ธ์ง ํ์ต (noise-aware training, NAT) ๊ธฐ๋ฒ์ ํฌ๊ฒ ๋ฐ์ด๋๋ ์ฑ๋ฅ์ ๋ณด์๋ค.
๋ ๋ฒ์งธ๋ DNN์ ํ์ฉํ ๋ค ์ฑ๋ ํน์ง ํฅ์ ๊ธฐ๋ฒ์ด๋ค. ๊ธฐ์กด์ ๋ค ์ฑ๋ ์๋๋ฆฌ์ค์์๋ ์ ํต์ ์ธ ์ ํธ ์ฒ๋ฆฌ ๊ธฐ๋ฒ์ธ ๋นํฌ๋ฐ ๊ธฐ๋ฒ์ ํตํ์ฌ ํฅ์๋ ๋จ์ผ ์์ค ์์ฑ ์ ํธ๋ฅผ ์ถ์ถํ๊ณ ๊ทธ๋ฅผ ํตํ์ฌ ์์ฑ์ธ์์ ์ํํ๋ค. ์ฐ๋ฆฌ๋ ๊ธฐ์กด์ ๋นํฌ๋ฐ ์ค์์ ๊ฐ์ฅ ๊ธฐ๋ณธ์ ๊ธฐ๋ฒ ์ค ํ๋์ธ delay-and-sum (DS) ๋นํฌ๋ฐ ๊ธฐ๋ฒ๊ณผ DNN์ ๊ฒฐํฉํ ๋ค ์ฑ๋ ํน์ง ํฅ์ ๊ธฐ๋ฒ์ ์ ์ํ๋ค. ์ ์ํ๋ DNN์ ์ค๊ฐ ๋จ๊ณ ํน์ง ๋ฒกํฐ๋ฅผ ํ์ฉํ ๊ณต๋ ํ์ต ๊ธฐ๋ฒ์ ํตํ์ฌ ์๊ณก๋ ๋ค ์ฑ๋ ์
๋ ฅ ์์ฑ ์ ํธ๋ค๊ณผ ๊นจ๋ํ ์์ฑ ์ ํธ์์ ๊ด๊ณ๋ฅผ ํจ๊ณผ์ ์ผ๋ก ํํํ๋ค. ์ ์๋ ๊ธฐ๋ฒ์ multichannel wall street journal audio visual (MC-WSJAV) corpus์์์ ์คํ์ ํตํ์ฌ, ๊ธฐ์กด์ ๋ค์ฑ๋ ํฅ์ ๊ธฐ๋ฒ๋ค๋ณด๋ค ๋ฐ์ด๋ ์ฑ๋ฅ์ ๋ณด์์ ํ์ธํ์๋ค.
๋ง์ง๋ง์ผ๋ก, ๋ถํ์ ์ฑ ์ธ์ง ํ์ต (Uncertainty-aware training, UAT) ๊ธฐ๋ฒ์ด๋ค. ์์์ ์๊ฐ๋ ๊ธฐ๋ฒ๋ค์ ํฌํจํ์ฌ ๊ฐ์ธํ ์์ฑ์ธ์์ ์ํ ๊ธฐ์กด์ DNN ๊ธฐ๋ฐ ๊ธฐ๋ฒ๋ค์ ๊ฐ๊ฐ์ ๋คํธ์ํฌ์ ํ๊ฒ์ ์ถ์ ํ๋๋ฐ ์์ด์ ๊ฒฐ์ ๋ก ์ ์ธ ์ถ์ ๋ฐฉ์์ ์ฌ์ฉํ๋ค. ์ด๋ ์ถ์ ์น์ ๋ถํ์ ์ฑ ๋ฌธ์ ํน์ ์ ๋ขฐ๋ ๋ฌธ์ ๋ฅผ ์ผ๊ธฐํ๋ค. ์ด๋ฌํ ๋ฌธ์ ์ ์ ๊ทน๋ณตํ๊ธฐ ์ํ์ฌ ์ ์ํ๋ UAT ๊ธฐ๋ฒ์ ํ๋ฅ ๋ก ์ ์ธ ๋ณํ ์ถ์ ์ ํ์ตํ๊ณ ์ํํ ์ ์๋ ๋ด๋ด ๋คํธ์ํฌ ๋ชจ๋ธ์ธ ๋ณํ ์คํ ์ธ์ฝ๋ (variational autoencoder, VAE) ๋ชจ๋ธ์ ์ฌ์ฉํ๋ค. UAT๋ ์๊ณก๋ ์์ฑ ํน์ง ๋ฒกํฐ์ ์์ ํ๊ฒ๊ณผ์ ๊ด๊ณ๋ฅผ ๋งค๊ฐํ๋ ๊ฐ์ธํ ์๋ ๋ณ์๋ฅผ ๊นจ๋ํ ์์ฑ ํน์ง ๋ฒกํฐ ์ถ์ ์น์ ๋ถํฌ ์ ๋ณด๋ฅผ ์ด์ฉํ์ฌ ๋ชจ๋ธ๋งํ๋ค. UAT์ ์๋ ๋ณ์๋ค์ ๋ฅ ๋ฌ๋ ๊ธฐ๋ฐ ์ํฅ ๋ชจ๋ธ์ ์ต์ ํ๋ uncertainty decoding (UD) ํ๋ ์์ํฌ๋ก๋ถํฐ ์ ๋๋ ์ต๋ ์ฐ๋ ๊ธฐ์ค์ ๋ฐ๋ผ์ ํ์ต๋๋ค. ์ ์๋ ๊ธฐ๋ฒ์ Aurora-4 DB์ CHiME-4 DB์์ ๊ธฐ์กด์ DNN ๊ธฐ๋ฐ ๊ธฐ๋ฒ๋ค์ ํฌ๊ฒ ๋ฐ์ด๋๋ ์ฑ๋ฅ์ ๋ณด์๋ค.In this thesis, we propose three acoustic modeling techniques for robust automatic speech recognition (ASR). Firstly, we propose a DNN-based acoustic modeling technique which makes the best use of the inherent noise-robustness of DNN is proposed. By applying this technique, the DNN can automatically learn the complicated relationship among the noisy, clean speech and noise estimate to phonetic target smoothly. The proposed method outperformed noise-aware training (NAT), i.e., the conventional auxiliary-feature-based model adaptation technique in Aurora-5 DB.
The second method is multi-channel feature enhancement technique. In the general multi-channel speech recognition scenario, the enhanced single speech signal source is extracted from the multiple inputs using beamforming, i.e., the conventional signal-processing-based technique and the speech recognition process is performed by feeding that source into the acoustic model. We propose the multi-channel feature enhancement DNN algorithm by properly combining the delay-and-sum (DS) beamformer, which is one of the conventional beamforming techniques and DNN. Through the experiments using multichannel wall street journal audio visual (MC-WSJ-AV) corpus, it has been shown that the proposed method outperformed the conventional multi-channel feature enhancement techniques.
Finally, uncertainty-aware training (UAT) technique is proposed. The most of the existing DNN-based techniques including the techniques introduced above, aim to optimize the point estimates of the targets (e.g., clean features, and acoustic model parameters). This tampers with the reliability of the estimates. In order to overcome this issue, UAT employs a modified structure of variational autoencoder (VAE), a neural network model which learns and performs stochastic variational inference (VIF). UAT models the robust latent variables which intervene the mapping between the noisy observed features and the phonetic target using the distributive information of the clean feature estimates. The proposed technique outperforms the conventional DNN-based techniques on Aurora-4 and CHiME-4 databases.Abstract i
Contents iv
List of Figures ix
List of Tables xiii
1 Introduction 1
2 Background 9
2.1 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Experimental Database . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Aurora-4 DB . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Aurora-5 DB . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 MC-WSJ-AV DB . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.4 CHiME-4 DB . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Two-stage Noise-aware Training for Environment-robust Speech
Recognition 25
iii
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Noise-aware Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Two-stage NAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.1 Lower DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.2 Upper DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.3 Joint Training . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.1 GMM-HMM System . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.2 Training and Structures of DNN-based Techniques . . . . . . 37
3.4.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 40
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 DNN-based Feature Enhancement for Robust Multichannel Speech
Recognition 45
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Observation Model in Multi-Channel Reverberant Noisy Environment 49
4.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.1 Lower DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.2 Upper DNN and Joint Training . . . . . . . . . . . . . . . . . 54
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4.1 Recognition System and Feature Extraction . . . . . . . . . . 56
4.4.2 Training and Structures of DNN-based Techniques . . . . . . 58
4.4.3 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 62
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
iv
5 Uncertainty-aware Training for DNN-HMM System using Varia-
tional Inference 67
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Uncertainty Decoding for Noise Robustness . . . . . . . . . . . . . . 72
5.3 Variational Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.4 VIF-based uncertainty-aware Training . . . . . . . . . . . . . . . . . 83
5.4.1 Clean Uncertainty Network . . . . . . . . . . . . . . . . . . . 91
5.4.2 Environment Uncertainty Network . . . . . . . . . . . . . . . 93
5.4.3 Prediction Network and Joint Training . . . . . . . . . . . . . 95
5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.5.1 Experimental Setup: Feature Extraction and ASR System . . 96
5.5.2 Network Structures . . . . . . . . . . . . . . . . . . . . . . . . 98
5.5.3 Eects of CUN on the Noise Robustness . . . . . . . . . . . . 104
5.5.4 Uncertainty Representation in Dierent SNR Condition . . . 105
5.5.5 Result of Speech Recognition . . . . . . . . . . . . . . . . . . 112
5.5.6 Result of Speech Recognition with LSTM-HMM . . . . . . . 114
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6 Conclusions 127
Bibliography 131
์์ฝ 145Docto
CFAD: A Chinese Dataset for Fake Audio Detection
Fake audio detection is a growing concern and some relevant datasets have
been designed for research. However, there is no standard public Chinese
dataset under complex conditions.In this paper, we aim to fill in the gap and
design a Chinese fake audio detection dataset (CFAD) for studying more
generalized detection methods. Twelve mainstream speech-generation techniques
are used to generate fake audio. To simulate the real-life scenarios, three
noise datasets are selected for noise adding at five different signal-to-noise
ratios, and six codecs are considered for audio transcoding (format
conversion). CFAD dataset can be used not only for fake audio detection but
also for detecting the algorithms of fake utterances for audio forensics.
Baseline results are presented with analysis. The results that show fake audio
detection methods with generalization remain challenging. The CFAD dataset is
publicly available at: https://zenodo.org/record/8122764.Comment: FAD renamed as CFA
Dictionary Attacks on Speaker Verification
In this paper, we propose dictionary attacks against speaker verification - a novel attack vector that aims to match a large fraction of speaker population by chance. We introduce a generic formulation of the attack that can be used with various speech representations and threat models. The attacker uses adversarial optimization to maximize raw similarity of speaker embeddings between a seed speech sample and a proxy population. The resulting master voice successfully matches a non-trivial fraction of people in an unknown population. Adversarial waveforms obtained with our approach can match on average 69% of females and 38% of males enrolled in the target system at a strict decision threshold calibrated to yield false alarm rate of 1%. By using the attack with a black-box voice cloning system, we obtain master voices that are effective in the most challenging conditions and transferable between speaker encoders. We also show that, combined with multiple attempts, this attack opens even more to serious issues on the security of these systems
Domain Generalization in Machine Learning Models for Wireless Communications: Concepts, State-of-the-Art, and Open Issues
Data-driven machine learning (ML) is promoted as one potential technology to
be used in next-generations wireless systems. This led to a large body of
research work that applies ML techniques to solve problems in different layers
of the wireless transmission link. However, most of these applications rely on
supervised learning which assumes that the source (training) and target (test)
data are independent and identically distributed (i.i.d). This assumption is
often violated in the real world due to domain or distribution shifts between
the source and the target data. Thus, it is important to ensure that these
algorithms generalize to out-of-distribution (OOD) data. In this context,
domain generalization (DG) tackles the OOD-related issues by learning models on
different and distinct source domains/datasets with generalization capabilities
to unseen new domains without additional finetuning. Motivated by the
importance of DG requirements for wireless applications, we present a
comprehensive overview of the recent developments in DG and the different
sources of domain shift. We also summarize the existing DG methods and review
their applications in selected wireless communication problems, and conclude
with insights and open questions
UNSUPERVISED DOMAIN ADAPTATION FOR SPEAKER VERIFICATION IN THE WILD
Performance of automatic speaker verification (ASV) systems is very sensitive
to mismatch between training (source) and testing (target) domains. The
best way to address domain mismatch is to perform matched condition training
โ gather sufficient labeled samples from the target domain and use them in
training. However, in many cases this is too expensive or impractical. Usually,
gaining access to unlabeled target domain data, e.g., from open source online
media, and labeled data from other domains is more feasible. This work focuses
on making ASV systems robust to uncontrolled (โwildโ) conditions, with
the help of some unlabeled data acquired from such conditions.
Given acoustic features from both domains, we propose learning a mapping
function โ a deep convolutional neural network (CNN) with an encoder-decoder
architecture โ between features of both the domains. We explore training the
network in two different scenarios: training on paired speech samples from
both domains and training on unpaired data. In the former case, where the
paired data is usually obtained via simulation, the CNN is treated as a nonii
ABSTRACT
linear regression function and is trained to minimize L2 loss between original
and predicted features from target domain. We provide empirical evidence that
this approach introduces distortions that affect verification performance. To
address this, we explore training the CNN using adversarial loss (along with
L2), which makes the predicted features indistinguishable from the original
ones, and thus, improve verification performance.
The above framework using simulated paired data, though effective, cannot
be used to train the network on unpaired data obtained by independently
sampling speech from both domains. In this case, we first train a CNN using
adversarial loss to map features from target to source. We, then, map the
predicted features back to the target domain using an auxiliary network, and
minimize a cycle-consistency loss between the original and reconstructed target
features.
Our unsupervised adaptation approach complements its supervised counterpart,
where adaptation is done using labeled data from both domains. We
focus on three domain mismatch scenarios: (1) sampling frequency mismatch
between the domains, (2) channel mismatch, and (3) robustness to far-field and
noisy speech acquired from wild conditions
An Experimental Analysis of Deep Learning Architectures for Supervised Speech Enhancement
Recent speech enhancement research has shown that deep learning techniques are very effective in removing background noise. Many deep neural networks are being proposed, showing promising results for improving overall speech perception. The Deep Multilayer Perceptron, Convolutional Neural Networks, and the Denoising Autoencoder are well-established architectures for speech enhancement; however, choosing between different deep learning models has been mainly empirical. Consequently, a comparative analysis is needed between these three architecture types in order to show the factors affecting their performance. In this paper, this analysis is presented by comparing seven deep learning models that belong to these three categories. The comparison includes evaluating the performance in terms of the overall quality of the output speech using five objective evaluation metrics and a subjective evaluation with 23 listeners; the ability to deal with challenging noise conditions; generalization ability; complexity; and, processing time. Further analysis is then provided while using two different approaches. The first approach investigates how the performance is affected by changing network hyperparameters and the structure of the data, including the Lombard effect. While the second approach interprets the results by visualizing the spectrogram of the output layer of all the investigated models, and the spectrograms of the hidden layers of the convolutional neural network architecture. Finally, a general evaluation is performed for supervised deep learning-based speech enhancement while using SWOC analysis, to discuss the techniqueโs Strengths, Weaknesses, Opportunities, and Challenges. The results of this paper contribute to the understanding of how different deep neural networks perform the speech enhancement task, highlight the strengths and weaknesses of each architecture, and provide recommendations for achieving better performance. This work facilitates the development of better deep neural networks for speech enhancement in the future
- โฆ