649 research outputs found
Robust Image Recognition Based on a New Supervised Kernel Subspace Learning Method
Fecha de lectura de Tesis Doctoral: 13 de septiembre 2019Image recognition is a term for computer technologies that can recognize certain people, objects or other targeted subjects through the use of algorithms and machine learning concepts. Face recognition is one of the most popular techniques to achieve the goal of figuring out the identity of a person. This study has been conducted to develop a new non-linear subspace learning method named βsupervised kernel locality-based discriminant neighborhood embedding,β which performs data classification by learning an optimum embedded subspace from a principal high dimensional space. In this approach, not only is a nonlinear and complex variation of face images effectively represented using nonlinear kernel mapping, but local structure information of data from the same class and discriminant information from distinct classes are also simultaneously preserved to further improve final classification performance. Moreover, to evaluate the robustness of the proposed method, it was compared with several well-known pattern recognition methods through comprehensive experiments with six publicly accessible datasets. In this research, we particularly focus on face recognition however, two other types of databases rather than face databases are also applied to well investigate the implementation of our algorithm. Experimental results reveal that our method consistently outperforms its competitors across a wide range of dimensionality on all the datasets. SKLDNE method has reached 100 percent of recognition rate for Tn=17 on the Sheffield, 9 on the Yale, 8 on the ORL, 7 on the Finger vein and 11on the Finger Knuckle respectively, while the results are much lower for other methods. This demonstrates the robustness and effectiveness of the proposed method
Face Recognition: Issues, Methods and Alternative Applications
Face recognition, as one of the most successful applications of image analysis, has recently gained significant attention. It is due to availability of feasible technologies, including mobile solutions. Research in automatic face recognition has been conducted since the 1960s, but the problem is still largely unsolved. Last decade has provided significant progress in this area owing to advances in face modelling and analysis techniques. Although systems have been developed for face detection and tracking, reliable face recognition still offers a great challenge to computer vision and pattern recognition researchers. There are several reasons for recent increased interest in face recognition, including rising public concern for security, the need for identity verification in the digital world, face analysis and modelling techniques in multimedia data management and computer entertainment. In this chapter, we have discussed face recognition processing, including major components such as face detection, tracking, alignment and feature extraction, and it points out the technical challenges of building a face recognition system. We focus on the importance of the most successful solutions available so far. The final part of the chapter describes chosen face recognition methods and applications and their potential use in areas not related to face recognition
λ₯λ¬λμ νμ©ν μ€νμΌ μ μν μμ± ν©μ± κΈ°λ²
νμλ
Όλ¬Έ (λ°μ¬) -- μμΈλνκ΅ λνμ : 곡과λν μ κΈ°Β·μ»΄ν¨ν°κ³΅νλΆ, 2020. 8. κΉλ¨μ.The neural network-based speech synthesis techniques have been developed over the years. Although neural speech synthesis has shown remarkable generated speech quality, there are still remaining problems such as modeling power in a neural statistical parametric speech synthesis system, style expressiveness, and robust attention model in the end-to-end speech synthesis system. In this thesis, novel alternatives are proposed to resolve these drawbacks of the conventional neural speech synthesis system.
In the first approach, we propose an adversarially trained variational recurrent neural network (AdVRNN), which applies a variational recurrent neural network (VRNN) to represent the variability of natural speech for acoustic modeling in neural statistical parametric speech synthesis. Also, we apply an adversarial learning scheme in training AdVRNN to overcome the oversmoothing problem. From the experimental results, we have found that the proposed AdVRNN based method outperforms the conventional RNN-based techniques.
In the second approach, we propose a novel style modeling method employing mutual information neural estimator (MINE) in a style-adaptive end-to-end speech synthesis system. MINE is applied to increase target-style information and suppress text information in style embedding by applying MINE loss term in the loss function. The experimental results show that the MINE-based method has shown promising performance in both speech quality and style similarity for the global style token-Tacotron.
In the third approach, we propose a novel attention method called memory attention for end-to-end speech synthesis, which is inspired by the gating mechanism of long-short term memory (LSTM). Leveraging the gating technique's sequence modeling power in LSTM, memory attention obtains the stable alignment from the content-based and location-based features. We evaluate the memory attention and compare its performance with various conventional attention techniques in single speaker and emotional speech synthesis scenarios. From the results, we conclude that memory attention can generate speech with large variability robustly.
In the last approach, we propose selective multi-attention for style-adaptive end-to-end speech synthesis systems. The conventional single attention model may limit the expressivity representing numerous alignment paths depending on style. To achieve a variation in attention alignment, we propose using a multi-attention model with a selection network. The multi-attention plays a role in generating candidates for the target style, and the selection network choose the most proper attention among the multi-attention. The experimental results show that selective multi-attention outperforms the conventional single attention techniques in multi-speaker speech synthesis and emotional speech synthesis.λ₯λ¬λ κΈ°λ°μ μμ± ν©μ± κΈ°μ μ μ§λ λͺ λ
κ° νλ°νκ² κ°λ°λκ³ μλ€. λ₯λ¬λμ λ€μν κΈ°λ²μ μ¬μ©νμ¬ μμ± ν©μ± νμ§μ λΉμ½μ μΌλ‘ λ°μ νμ§λ§, μμ§ λ₯λ¬λ κΈ°λ°μ μμ± ν©μ±μλ μ¬λ¬ λ¬Έμ κ° μ‘΄μ¬νλ€. λ₯λ¬λ κΈ°λ°μ ν΅κ³μ νλΌλ―Έν° κΈ°λ²μ κ²½μ° μν₯ λͺ¨λΈμ deterministicν λͺ¨λΈμ νμ©νμ¬ λͺ¨λΈλ§ λ₯λ ₯μ νκ³κ° μμΌλ©°, μ’
λ¨ν λͺ¨λΈμ κ²½μ° μ€νμΌμ νννλ λ₯λ ₯κ³Ό κ°μΈν μ΄ν
μ
(attention)μ λν μ΄μκ° λμμμ΄ μ¬κΈ°λκ³ μλ€. λ³Έ λ
Όλ¬Έμμλ μ΄λ¬ν κΈ°μ‘΄μ λ₯λ¬λ κΈ°λ° μμ± ν©μ± μμ€ν
μ λ¨μ μ ν΄κ²°ν μλ‘μ΄ λμμ μ μνλ€.
첫 λ²μ§Έ μ κ·Όλ²μΌλ‘μ, λ΄λ΄ ν΅κ³μ νλΌλ―Έν° λ°©μμ μν₯ λͺ¨λΈλ§μ κ³ λννκΈ° μν adversarially trained variational recurrent neural network (AdVRNN) κΈ°λ²μ μ μνλ€. AdVRNN κΈ°λ²μ VRNNμ μμ± ν©μ±μ μ μ©νμ¬ μμ±μ λ³νλ₯Ό stochastic νκ³ μμΈνκ² λͺ¨λΈλ§ν μ μλλ‘ νμλ€. λν, μ λμ νμ΅μ (adversarial learning) κΈ°λ²μ νμ©νμ¬ oversmoothing λ¬Έμ λ₯Ό μ΅μν μν€λλ‘ νμλ€. μ΄λ¬ν μ μλ μκ³ λ¦¬μ¦μ κΈ°μ‘΄μ μν μ κ²½λ§ κΈ°λ°μ μν₯ λͺ¨λΈκ³Ό λΉκ΅νμ¬ μ±λ₯μ΄ ν₯μλ¨μ νμΈνμλ€.
λ λ²μ§Έ μ κ·Όλ²μΌλ‘μ, μ€νμΌ μ μν μ’
λ¨ν μμ± ν©μ± κΈ°λ²μ μν μνΈ μ 보λ κΈ°λ°μ μλ‘μ΄ νμ΅ κΈ°λ²μ μ μνλ€. κΈ°μ‘΄μ global style token(GST) κΈ°λ°μ μ€νμΌ μμ± ν©μ± κΈ°λ²μ κ²½μ°, λΉμ§λ νμ΅μ μ¬μ©νλ―λ‘ μνλ λͺ©ν μ€νμΌμ΄ μμ΄λ μ΄λ₯Ό μ€μ μ μΌλ‘ νμ΅μν€κΈ° μ΄λ €μ λ€. μ΄λ₯Ό ν΄κ²°νκΈ° μν΄ GSTμ μΆλ ₯κ³Ό λͺ©ν μ€νμΌ μλ² λ© λ²‘ν°μ μνΈ μ 보λμ μ΅λν νλλ‘ νμ΅ μν€λ κΈ°λ²μ μ μνμλ€. μνΈ μ 보λμ μ’
λ¨ν λͺ¨λΈμ μμ€ν¨μμ μ μ©νκΈ° μν΄μ mutual information neural estimator(MINE) κΈ°λ²μ λμ
νμκ³ λ€νμ λͺ¨λΈμ ν΅ν΄ κΈ°μ‘΄μ GST κΈ°λ²μ λΉν΄ λͺ©ν μ€νμΌμ λ³΄λ€ μ€μ μ μΌλ‘ νμ΅μν¬ μ μμμ νμΈνμλ€.
μΈλ²μ§Έ μ κ·Όλ²μΌλ‘μ, κ°μΈν μ’
λ¨ν μμ± ν©μ±μ μ΄ν
μ
μΈ memory attentionμ μ μνλ€. Long-short term memory(LSTM)μ gating κΈ°μ μ sequenceλ₯Ό λͺ¨λΈλ§νλλ° λμ μ±λ₯μ 보μ¬μλ€. μ΄λ¬ν κΈ°μ μ μ΄ν
μ
μ μ μ©νμ¬ λ€μν μ€νμΌμ κ°μ§ μμ±μμλ μ΄ν
μ
μ λκΉ, λ°λ³΅ λ±μ μ΅μνν μ μλ κΈ°λ²μ μ μνλ€. λ¨μΌ νμμ κ°μ μμ± ν©μ± κΈ°λ²μ ν λλ‘ memory attentionμ μ±λ₯μ νμΈνμμΌλ©° κΈ°μ‘΄ κΈ°λ² λλΉ λ³΄λ€ μμ μ μΈ μ΄ν
μ
곑μ μ μ»μ μ μμμ νμΈνμλ€.
λ§μ§λ§ μ κ·Όλ²μΌλ‘μ, selective multi-attention (SMA)μ νμ©ν μ€νμΌ μ μν μ’
λ¨ν μμ± ν©μ± μ΄ν
μ
κΈ°λ²μ μ μνλ€. κΈ°μ‘΄μ μ€νμΌ μ μν μ’
λ¨ν μμ± ν©μ±μ μ°κ΅¬μμλ λλ
체 λ¨μΌνμμ κ²½μ°μ κ°μ λ¨μΌ μ΄ν
μ
μ μ¬μ©νμ¬ μλ€. νμ§λ§ μ€νμΌ μμ±μ κ²½μ° λ³΄λ€ λ€μν μ΄ν
μ
ννμ μꡬνλ€. μ΄λ₯Ό μν΄ λ€μ€ μ΄ν
μ
μ νμ©νμ¬ ν보λ€μ μμ±νκ³ μ΄λ₯Ό μ ν λ€νΈμν¬λ₯Ό νμ©νμ¬ μ΅μ μ μ΄ν
μ
μ μ ννλ κΈ°λ²μ μ μνλ€. SMA κΈ°λ²μ κΈ°μ‘΄μ μ΄ν
μ
κ³Όμ λΉκ΅ μ€νμ ν΅νμ¬ λ³΄λ€ λ§μ μ€νμΌμ μμ μ μΌλ‘ ννν μ μμμ νμΈνμλ€.1 Introduction 1
1.1 Background 1
1.2 Scope of thesis 3
2 Neural Speech Synthesis System 7
2.1 Overview of a Neural Statistical Parametric Speech Synthesis System 7
2.2 Overview of End-to-end Speech Synthesis System 9
2.3 Tacotron2 10
2.4 Attention Mechanism 12
2.4.1 Location Sensitive Attention 12
2.4.2 Forward Attention 13
2.4.3 Dynamic Convolution Attention 14
3 Neural Statistical Parametric Speech Synthesis using AdVRNN 17
3.1 Introduction 17
3.2 Background 19
3.2.1 Variational Autoencoder 19
3.2.2 Variational Recurrent Neural Network 20
3.3 Speech Synthesis Using AdVRNN 22
3.3.1 AdVRNN based Acoustic Modeling 23
3.3.2 Training Procedure 24
3.4 Experiments 25
3.4.1 Objective performance evaluation 28
3.4.2 Subjective performance evaluation 29
3.5 Summary 29
4 Speech Style Modeling Method using Mutual Information for End-to-End Speech Synthesis 31
4.1 Introduction 31
4.2 Background 33
4.2.1 Mutual Information 33
4.2.2 Mutual Information Neural Estimator 34
4.2.3 Global Style Token 34
4.3 Style Token end-to-end speech synthesis using MINE 35
4.4 Experiments 36
4.5 Summary 38
5 Memory Attention: Robust Alignment using Gating Mechanism for End-to-End Speech Synthesis 45
5.1 Introduction 45
5.2 BACKGROUND 48
5.3 Memory Attention 49
5.4 Experiments 52
5.4.1 Experiments on Single Speaker Speech Synthesis 53
5.4.2 Experiments on Emotional Speech Synthesis 56
5.5 Summary 59
6 Selective Multi-attention for style-adaptive end-to-End Speech Syn-thesis 63
6.1 Introduction 63
6.2 BACKGROUND 65
6.3 Selective multi-attention model 66
6.4 EXPERIMENTS 67
6.4.1 Multi-speaker speech synthesis experiments 68
6.4.2 Experiments on Emotional Speech Synthesis 73
6.5 Summary 77
7 Conclusions 79
Bibliography 83
μμ½ 93
κ°μ¬μ κΈ 95Docto
Meta-learning with Latent Space Clustering in Generative Adversarial Network for Speaker Diarization
The performance of most speaker diarization systems with x-vector embeddings
is both vulnerable to noisy environments and lacks domain robustness. Earlier
work on speaker diarization using generative adversarial network (GAN) with an
encoder network (ClusterGAN) to project input x-vectors into a latent space has
shown promising performance on meeting data. In this paper, we extend the
ClusterGAN network to improve diarization robustness and enable rapid
generalization across various challenging domains. To this end, we fetch the
pre-trained encoder from the ClusterGAN and fine-tune it by using prototypical
loss (meta-ClusterGAN or MCGAN) under the meta-learning paradigm. Experiments
are conducted on CALLHOME telephonic conversations, AMI meeting data, DIHARD II
(dev set) which includes challenging multi-domain corpus, and two
child-clinician interaction corpora (ADOS, BOSCC) related to the autism
spectrum disorder domain. Extensive analyses of the experimental data are done
to investigate the effectiveness of the proposed ClusterGAN and MCGAN
embeddings over x-vectors. The results show that the proposed embeddings with
normalized maximum eigengap spectral clustering (NME-SC) back-end consistently
outperform Kaldi state-of-the-art z-vector diarization system. Finally, we
employ embedding fusion with x-vectors to provide further improvement in
diarization performance. We achieve a relative diarization error rate (DER)
improvement of 6.67% to 53.93% on the aforementioned datasets using the
proposed fused embeddings over x-vectors. Besides, the MCGAN embeddings provide
better performance in the number of speakers estimation and short speech
segment diarization as compared to x-vectors and ClusterGAN in telephonic data.Comment: Submitted to IEEE/ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE
PROCESSIN
A Review of Deep Learning Techniques for Speech Processing
The field of speech processing has undergone a transformative shift with the
advent of deep learning. The use of multiple processing layers has enabled the
creation of models capable of extracting intricate features from speech data.
This development has paved the way for unparalleled advancements in speech
recognition, text-to-speech synthesis, automatic speech recognition, and
emotion recognition, propelling the performance of these tasks to unprecedented
heights. The power of deep learning techniques has opened up new avenues for
research and innovation in the field of speech processing, with far-reaching
implications for a range of industries and applications. This review paper
provides a comprehensive overview of the key deep learning models and their
applications in speech-processing tasks. We begin by tracing the evolution of
speech processing research, from early approaches, such as MFCC and HMM, to
more recent advances in deep learning architectures, such as CNNs, RNNs,
transformers, conformers, and diffusion models. We categorize the approaches
and compare their strengths and weaknesses for solving speech-processing tasks.
Furthermore, we extensively cover various speech-processing tasks, datasets,
and benchmarks used in the literature and describe how different deep-learning
networks have been utilized to tackle these tasks. Additionally, we discuss the
challenges and future directions of deep learning in speech processing,
including the need for more parameter-efficient, interpretable models and the
potential of deep learning for multimodal speech processing. By examining the
field's evolution, comparing and contrasting different approaches, and
highlighting future directions and challenges, we hope to inspire further
research in this exciting and rapidly advancing field
- β¦