942 research outputs found
Improvement of Text Dependent Speaker Identification System Using Neuro-Genetic Hybrid Algorithm in Office Environmental Conditions
In this paper, an improved strategy for automated text dependent speaker identification system has been proposed in noisy environment. The identification process incorporates the Neuro-Genetic hybrid algorithm with cepstral based features. To remove the background noise from the source utterance, wiener filter has been used. Different speech pre-processing techniques such as start-end point detection algorithm, pre-emphasis filtering, frame blocking and windowing have been used to process the speech utterances. RCC, MFCC, ?MFCC, ??MFCC, LPC and LPCC have been used to extract the features. After feature extraction of the speech, Neuro-Genetic hybrid algorithm has been used in the learning and identification purposes. Features are extracted by using different techniques to optimize the performance of the identification. According to the VALID speech database, the highest speaker identification rate of 100.000% for studio environment and 82.33% for office environmental conditions have been achieved in the close set text dependent speaker identification system
CAPTλ₯Ό μν λ°μ λ³μ΄ λΆμ λ° CycleGAN κΈ°λ° νΌλλ°± μμ±
νμλ
Όλ¬Έ(λ°μ¬)--μμΈλνκ΅ λνμ :μΈλ¬Έλν νλκ³Όμ μΈμ§κ³Όνμ 곡,2020. 2. μ λ―Όν.Despite the growing popularity in learning Korean as a foreign language and the rapid development in language learning applications, the existing computer-assisted pronunciation training (CAPT) systems in Korean do not utilize linguistic characteristics of non-native Korean speech. Pronunciation variations in non-native speech are far more diverse than those observed in native speech, which may pose a difficulty in combining such knowledge in an automatic system. Moreover, most of the existing methods rely on feature extraction results from signal processing, prosodic analysis, and natural language processing techniques. Such methods entail limitations since they necessarily depend on finding the right features for the task and the extraction accuracies.
This thesis presents a new approach for corrective feedback generation in a CAPT system, in which pronunciation variation patterns and linguistic correlates with accentedness are analyzed and combined with a deep neural network approach, so that feature engineering efforts are minimized while maintaining the linguistically important factors for the corrective feedback generation task. Investigations on non-native Korean speech characteristics in contrast with those of native speakers, and their correlation with accentedness judgement show that both segmental and prosodic variations are important factors in a Korean CAPT system.
The present thesis argues that the feedback generation task can be interpreted as a style transfer problem, and proposes to evaluate the idea using generative adversarial network. A corrective feedback generation model is trained on 65,100 read utterances by 217 non-native speakers of 27 mother tongue backgrounds. The features are automatically learnt in an unsupervised way in an auxiliary classifier CycleGAN setting, in which the generator learns to map a foreign accented speech to native speech distributions. In order to inject linguistic knowledge into the network, an auxiliary classifier is trained so that the feedback also identifies the linguistic error types that were defined in the first half of the thesis. The proposed approach generates a corrected version the speech using the learners own voice, outperforming the conventional Pitch-Synchronous Overlap-and-Add method.μΈκ΅μ΄λ‘μμ νκ΅μ΄ κ΅μ‘μ λν κ΄μ¬μ΄ κ³ μ‘°λμ΄ νκ΅μ΄ νμ΅μμ μκ° ν¬κ² μ¦κ°νκ³ μμΌλ©°, μμ±μΈμ΄μ²λ¦¬ κΈ°μ μ μ μ©ν μ»΄ν¨ν° κΈ°λ° λ°μ κ΅μ‘(Computer-Assisted Pronunciation Training; CAPT) μ΄ν리μΌμ΄μ
μ λν μ°κ΅¬ λν μ κ·Ήμ μΌλ‘ μ΄λ£¨μ΄μ§κ³ μλ€. κ·ΈλΌμλ λΆκ΅¬νκ³ νμ‘΄νλ νκ΅μ΄ λ§νκΈ° κ΅μ‘ μμ€ν
μ μΈκ΅μΈμ νκ΅μ΄μ λν μΈμ΄νμ νΉμ§μ μΆ©λΆν νμ©νμ§ μκ³ μμΌλ©°, μ΅μ μΈμ΄μ²λ¦¬ κΈ°μ λν μ μ©λμ§ μκ³ μλ μ€μ μ΄λ€. κ°λ₯ν μμΈμΌλ‘μ¨λ μΈκ΅μΈ λ°ν νκ΅μ΄ νμμ λν λΆμμ΄ μΆ©λΆνκ² μ΄λ£¨μ΄μ§μ§ μμλ€λ μ , κ·Έλ¦¬κ³ κ΄λ ¨ μ°κ΅¬κ° μμ΄λ μ΄λ₯Ό μλνλ μμ€ν
μ λ°μνκΈ°μλ κ³ λνλ μ°κ΅¬κ° νμνλ€λ μ μ΄ μλ€. λΏλ§ μλλΌ CAPT κΈ°μ μ λ°μ μΌλ‘λ μ νΈμ²λ¦¬, μ΄μ¨ λΆμ, μμ°μ΄μ²λ¦¬ κΈ°λ²κ³Ό κ°μ νΉμ§ μΆμΆμ μμ‘΄νκ³ μμ΄μ μ ν©ν νΉμ§μ μ°Ύκ³ μ΄λ₯Ό μ ννκ² μΆμΆνλ λ°μ λ§μ μκ°κ³Ό λ
Έλ ₯μ΄ νμν μ€μ μ΄λ€. μ΄λ μ΅μ λ₯λ¬λ κΈ°λ° μΈμ΄μ²λ¦¬ κΈ°μ μ νμ©ν¨μΌλ‘μ¨ μ΄ κ³Όμ λν λ°μ μ μ¬μ§κ° λ§λ€λ λ°λ₯Ό μμ¬νλ€.
λ°λΌμ λ³Έ μ°κ΅¬λ λ¨Όμ CAPT μμ€ν
κ°λ°μ μμ΄ λ°μ λ³μ΄ μμκ³Ό μΈμ΄νμ μκ΄κ΄κ³λ₯Ό λΆμνμλ€. μΈκ΅μΈ νμλ€μ λλ
체 λ³μ΄ μμκ³Ό νκ΅μ΄ μμ΄λ―Ό νμλ€μ λλ
체 λ³μ΄ μμμ λμ‘°νκ³ μ£Όμν λ³μ΄λ₯Ό νμΈν ν, μκ΄κ΄κ³ λΆμμ ν΅νμ¬ μμ¬μν΅μ μν₯μ λ―ΈμΉλ μ€μλλ₯Ό νμ
νμλ€. κ·Έ κ²°κ³Ό, μ’
μ± μμ μ 3μ€ λ립μ νΌλ, μ΄λΆμ κ΄λ ¨ μ€λ₯κ° λ°μν κ²½μ° νΌλλ°± μμ±μ μ°μ μ μΌλ‘ λ°μνλ κ²μ΄ νμνλ€λ κ²μ΄ νμΈλμλ€.
κ΅μ λ νΌλλ°±μ μλμΌλ‘ μμ±νλ κ²μ CAPT μμ€ν
μ μ€μν κ³Όμ μ€ νλμ΄λ€. λ³Έ μ°κ΅¬λ μ΄ κ³Όμ κ° λ°νμ μ€νμΌ λ³νμ λ¬Έμ λ‘ ν΄μμ΄ κ°λ₯νλ€κ³ 보μμΌλ©°, μμ±μ μ λ μ κ²½λ§ (Cycle-consistent Generative Adversarial Network; CycleGAN) ꡬ쑰μμ λͺ¨λΈλ§νλ κ²μ μ μνμλ€. GAN λ€νΈμν¬μ μμ±λͺ¨λΈμ λΉμμ΄λ―Ό λ°νμ λΆν¬μ μμ΄λ―Ό λ°ν λΆν¬μ 맀νμ νμ΅νλ©°, Cycle consistency μμ€ν¨μλ₯Ό μ¬μ©ν¨μΌλ‘μ¨ λ°νκ° μ λ°μ μΈ κ΅¬μ‘°λ₯Ό μ μ§ν¨κ³Ό λμμ κ³Όλν κ΅μ μ λ°©μ§νμλ€. λ³λμ νΉμ§ μΆμΆ κ³Όμ μ΄ μμ΄ νμν νΉμ§λ€μ΄ CycleGAN νλ μμν¬μμ 무κ°λ
λ°©λ²μΌλ‘ μ€μ€λ‘ νμ΅λλ λ°©λ²μΌλ‘, μΈμ΄ νμ₯μ΄ μ©μ΄ν λ°©λ²μ΄λ€.
μΈμ΄νμ λΆμμμ λλ¬λ μ£Όμν λ³μ΄λ€ κ°μ μ°μ μμλ Auxiliary Classifier CycleGAN ꡬ쑰μμ λͺ¨λΈλ§νλ κ²μ μ μνμλ€. μ΄ λ°©λ²μ κΈ°μ‘΄μ CycleGANμ μ§μμ μ λͺ©μμΌ νΌλλ°± μμ±μ μμ±ν¨κ³Ό λμμ ν΄λΉ νΌλλ°±μ΄ μ΄λ€ μ νμ μ€λ₯μΈμ§ λΆλ₯νλ λ¬Έμ λ₯Ό μννλ€. μ΄λ λλ©μΈ μ§μμ΄ κ΅μ νΌλλ°± μμ± λ¨κ³κΉμ§ μ μ§λκ³ ν΅μ κ° κ°λ₯νλ€λ μ₯μ μ΄ μλ€λ λ°μ κ·Έ μμκ° μλ€.
λ³Έ μ°κ΅¬μμ μ μν λ°©λ²μ νκ°νκΈ° μν΄μ 27κ°μ λͺ¨κ΅μ΄λ₯Ό κ°λ 217λͺ
μ μ μλ―Έ μ΄ν λ°ν 65,100κ°λ‘ νΌλλ°± μλ μμ± λͺ¨λΈμ νλ ¨νκ³ , κ°μ μ¬λΆ λ° μ λμ λν μ§κ° νκ°λ₯Ό μννμλ€. μ μλ λ°©λ²μ μ¬μ©νμμ λ νμ΅μ λ³ΈμΈμ λͺ©μ리λ₯Ό μ μ§ν μ± κ΅μ λ λ°μμΌλ‘ λ³ννλ κ²μ΄ κ°λ₯νλ©°, μ ν΅μ μΈ λ°©λ²μΈ μλμ΄ λκΈ°μ μ€μ²©κ°μ° (Pitch-Synchronous Overlap-and-Add) μκ³ λ¦¬μ¦μ μ¬μ©νλ λ°©λ²μ λΉν΄ μλ κ°μ λ₯ 16.67%μ΄ νμΈλμλ€.Chapter 1. Introduction 1
1.1. Motivation 1
1.1.1. An Overview of CAPT Systems 3
1.1.2. Survey of existing Korean CAPT Systems 5
1.2. Problem Statement 7
1.3. Thesis Structure 7
Chapter 2. Pronunciation Analysis of Korean Produced by Chinese 9
2.1. Comparison between Korean and Chinese 11
2.1.1. Phonetic and Syllable Structure Comparisons 11
2.1.2. Phonological Comparisons 14
2.2. Related Works 16
2.3. Proposed Analysis Method 19
2.3.1. Corpus 19
2.3.2. Transcribers and Agreement Rates 22
2.4. Salient Pronunciation Variations 22
2.4.1. Segmental Variation Patterns 22
2.4.1.1. Discussions 25
2.4.2. Phonological Variation Patterns 26
2.4.1.2. Discussions 27
2.5. Summary 29
Chapter 3. Correlation Analysis of Pronunciation Variations and Human Evaluation 30
3.1. Related Works 31
3.1.1. Criteria used in L2 Speech 31
3.1.2. Criteria used in L2 Korean Speech 32
3.2. Proposed Human Evaluation Method 36
3.2.1. Reading Prompt Design 36
3.2.2. Evaluation Criteria Design 37
3.2.3. Raters and Agreement Rates 40
3.3. Linguistic Factors Affecting L2 Korean Accentedness 41
3.3.1. Pearsons Correlation Analysis 41
3.3.2. Discussions 42
3.3.3. Implications for Automatic Feedback Generation 44
3.4. Summary 45
Chapter 4. Corrective Feedback Generation for CAPT 46
4.1. Related Works 46
4.1.1. Prosody Transplantation 47
4.1.2. Recent Speech Conversion Methods 49
4.1.3. Evaluation of Corrective Feedback 50
4.2. Proposed Method: Corrective Feedback as a Style Transfer 51
4.2.1. Speech Analysis at Spectral Domain 53
4.2.2. Self-imitative Learning 55
4.2.3. An Analogy: CAPT System and GAN Architecture 57
4.3. Generative Adversarial Networks 59
4.3.1. Conditional GAN 61
4.3.2. CycleGAN 62
4.4. Experiment 63
4.4.1. Corpus 64
4.4.2. Baseline Implementation 65
4.4.3. Adversarial Training Implementation 65
4.4.4. Spectrogram-to-Spectrogram Training 66
4.5. Results and Evaluation 69
4.5.1. Spectrogram Generation Results 69
4.5.2. Perceptual Evaluation 70
4.5.3. Discussions 72
4.6. Summary 74
Chapter 5. Integration of Linguistic Knowledge in an Auxiliary Classifier CycleGAN for Feedback Generation 75
5.1. Linguistic Class Selection 75
5.2. Auxiliary Classifier CycleGAN Design 77
5.3. Experiment and Results 80
5.3.1. Corpus 80
5.3.2. Feature Annotations 81
5.3.3. Experiment Setup 81
5.3.4. Results 82
5.4. Summary 84
Chapter 6. Conclusion 86
6.1. Thesis Results 86
6.2. Thesis Contributions 88
6.3. Recommendations for Future Work 89
Bibliography 91
Appendix 107
Abstract in Korean 117
Acknowledgments 120Docto
μ΄μ¨ μ 보λ₯Ό μ΄μ©ν λ§λΉλ§μ₯μ μμ± μλ κ²μΆ λ° νκ°
νμλ
Όλ¬Έ (μμ¬) -- μμΈλνκ΅ λνμ : μΈλ¬Έλν μΈμ΄νκ³Ό, 2020. 8. Minhwa Chung.λ§μ₯μ λ μ κ²½κ³ λλ ν΄νμ± μ§νμμ κ°μ₯ 빨리 λνλλ μ¦ μ μ€ νλμ΄λ€. λ§λΉλ§μ₯μ λ νν¨μ¨λ³, λμ± λ§λΉ, κ·ΌμμΆμ± μΈ‘μ κ²½νμ¦, λ€λ°μ± κ²½νμ¦ νμ λ± λ€μν νμκ΅°μμ λνλλ€. λ§λΉλ§μ₯μ λ μ‘°μκΈ°κ΄ μ κ²½μ μμμΌλ‘ λΆμ νν μ‘°μμ μ£Όμ νΉμ§μΌλ‘ κ°μ§κ³ , μ΄μ¨μλ μν₯μ λ―ΈμΉλ κ²μΌλ‘ λ³΄κ³ λλ€. μ ν μ°κ΅¬μμλ μ΄μ¨ κΈ°λ° μΈ‘μ μΉλ₯Ό λΉμ₯μ λ°νμ λ§λΉλ§μ₯μ λ°νλ₯Ό ꡬλ³νλ κ²μ μ¬μ©νλ€. μμ νμ₯μμλ λ§λΉλ§μ₯μ μ λν μ΄μ¨ κΈ°λ° λΆμμ΄ λ§λΉλ§μ₯μ λ₯Ό μ§λ¨νκ±°λ μ₯μ μμμ λ°λ₯Έ μλ§μ μΉλ£λ²μ μ€λΉνλ κ²μ λμμ΄ λ κ²μ΄λ€. λ°λΌμ λ§λΉλ§μ₯μ κ° μ΄μ¨μ μν₯μ λ―ΈμΉλ μμλΏλ§ μλλΌ λ§λΉλ§μ₯μ μ μ΄μ¨ νΉμ§μ κΈ΄λ°νκ² μ΄ν΄λ³΄λ κ²μ΄ νμνλ€. ꡬ체 μ μΌλ‘, μ΄μ¨μ΄ μ΄λ€ μΈ‘λ©΄μμ λ§λΉλ§μ₯μ μ μν₯μ λ°λμ§, κ·Έλ¦¬κ³ μ΄μ¨ μ κ° μ₯μ μ λμ λ°λΌ μ΄λ»κ² λ€λ₯΄κ² λνλλμ§μ λν λΆμμ΄ νμνλ€. λ³Έ λ
Όλ¬Έμ μλμ΄, μμ§, λ§μλ, λ¦¬λ¬ λ± μ΄μ¨μ λ€μν μΈ‘λ©΄μ μ μ΄ν΄λ³΄κ³ , λ§λΉλ§μ₯μ κ²μΆ λ° νκ°μ μ¬μ©νμλ€. μΆμΆλ μ΄μ¨ νΉμ§λ€μ λͺ κ°μ§ νΉμ§ μ ν μκ³ λ¦¬μ¦μ ν΅ν΄ μ΅μ νλμ΄ λ¨Έμ λ¬λ κΈ°λ° λΆλ₯κΈ°μ μ
λ ₯κ°μΌλ‘ μ¬μ©λμλ€. λΆλ₯κΈ°μ μ±λ₯μ μ νλ, μ λ°λ, μ¬νμ¨, F1-μ μλ‘ νκ°λμλ€. λν, λ³Έ λ
Όλ¬Έμ μ₯μ μ€μ¦λ(κ²½λ, μ€λ±λ, μ¬λ)μ λ°λΌ μ΄μ¨ μ 보 μ¬μ©μ μ μ©μ±μ λΆμνμλ€. λ§μ§λ§μΌλ‘, μ₯μ λ°ν μμ§μ΄ μ΄λ €μ΄ λ§νΌ, λ³Έ μ°κ΅¬λ κ΅μ°¨ μΈμ΄ λΆλ₯κΈ°λ₯Ό μ¬μ©νμλ€. νκ΅μ΄μ μμ΄ μ₯μ λ°νκ° νλ ¨ μ
μΌλ‘ μ¬μ©λμμΌλ©°, ν
μ€νΈμ
μΌλ‘λ κ° λͺ©ν μΈμ΄λ§μ΄ μ¬μ©λμλ€. μ€ν κ²°κ³Όλ λ€μκ³Ό κ°μ΄ μΈ κ°μ§λ₯Ό μμ¬νλ€. 첫째, μ΄μ¨ μ 보 λ₯Ό μ¬μ©νλ κ²μ λ§λΉλ§μ₯μ κ²μΆ λ° νκ°μ λμμ΄ λλ€. MFCC λ§μ μ¬μ©νμ λμ λΉκ΅νμ λ, μ΄μ¨ μ 보λ₯Ό ν¨κ» μ¬μ©νλ κ²μ΄ νκ΅μ΄μ μμ΄ λ°μ΄ν°μ
λͺ¨λμμ λμμ΄ λμλ€. λμ§Έ, μ΄μ¨ μ 보λ νκ°μ νΉν μ μ©νλ€. μμ΄μ κ²½μ° κ²μΆκ³Ό νκ°μμ κ°κ° 1.82%μ 20.6%μ μλμ μ νλ ν₯μμ 보μλ€. νκ΅μ΄μ κ²½μ° κ²μΆμμλ ν₯μμ 보μ΄μ§ μμμ§λ§, νκ°μμλ 13.6%μ μλμ ν₯μμ΄ λνλ¬λ€. μ
μ§Έ, κ΅μ°¨ μΈμ΄ λΆλ₯κΈ°λ λ¨μΌ μΈμ΄ λΆλ₯κΈ°λ³΄λ€ ν₯μλ κ²°κ³Όλ₯Ό 보μΈλ€. μ€ν κ²°κ³Ό κ΅μ°¨μΈμ΄ λΆλ₯κΈ°λ λ¨μΌ μΈμ΄ λΆλ₯κΈ°μ λΉκ΅νμ λ μλμ μΌλ‘ 4.12% λμ μ νλλ₯Ό 보μλ€. μ΄κ²μ νΉμ μ΄μ¨ μ₯μ λ λ²μΈμ΄μ νΉμ§μ κ°μ§λ©°, λ€λ₯Έ μΈμ΄ λ°μ΄ν°λ₯Ό ν¬ν¨μμΌ λ°μ΄ν°κ° λΆμ‘±ν νλ ¨ μ
μ 보μν μ μ μμ μμ¬νλ€.One of the earliest cues for neurological or degenerative disorders are speech impairments. Individuals with Parkinsons Disease, Cerebral Palsy, Amyotrophic lateral Sclerosis, Multiple Sclerosis among others are often diagnosed with dysarthria. Dysarthria is a group of speech disorders mainly affecting the articulatory muscles which eventually leads to severe misarticulation. However, impairments in the suprasegmental domain are also present and previous studies have shown that the prosodic patterns of speakers with dysarthria differ from the prosody of healthy speakers. In a clinical setting, a prosodic-based analysis of dysarthric speech can be helpful for diagnosing the presence of dysarthria. Therefore, there is a need to not only determine how the prosody of speech is affected by dysarthria, but also what aspects of prosody are more affected and how prosodic impairments change by the severity of dysarthria.
In the current study, several prosodic features related to pitch, voice quality, rhythm and speech rate are used as features for detecting dysarthria in a given speech signal. A variety of feature selection methods are utilized to determine which set of features are optimal for accurate detection. After selecting an optimal set of prosodic features we use them as input to machine learning-based classifiers and assess the performance using the evaluation metrics: accuracy, precision, recall and F1-score. Furthermore, we examine the usefulness of prosodic measures for assessing different levels of severity (e.g. mild, moderate, severe). Finally, as collecting impaired speech data can be difficult, we also implement cross-language classifiers where both Korean and English data are used for training but only one language used for testing. Results suggest that in comparison to solely using Mel-frequency cepstral coefficients, including prosodic measurements can improve the accuracy of classifiers for both Korean and English datasets. In particular, large improvements were seen when assessing different severity levels. For English a relative accuracy improvement of 1.82% for detection and 20.6% for assessment was seen. The Korean dataset saw no improvements for detection but a relative improvement of 13.6% for assessment. The results from cross-language experiments showed a relative improvement of up to 4.12% in comparison to only using a single language during training. It was found that certain prosodic impairments such as pitch and duration may be language independent. Therefore, when training sets of individual languages are limited, they may be supplemented by including data from other languages.1. Introduction 1
1.1. Dysarthria 1
1.2. Impaired Speech Detection 3
1.3. Research Goals & Outline 6
2. Background Research 8
2.1. Prosodic Impairments 8
2.1.1. English 8
2.1.2. Korean 10
2.2. Machine Learning Approaches 12
3. Database 18
3.1. English-TORGO 20
3.2. Korean-QoLT 21
4. Methods 23
4.1. Prosodic Features 23
4.1.1. Pitch 23
4.1.2. Voice Quality 26
4.1.3. Speech Rate 29
4.1.3. Rhythm 30
4.2. Feature Selection 34
4.3. Classification Models 38
4.3.1. Random Forest 38
4.3.1. Support Vector Machine 40
4.3.1 Feed-Forward Neural Network 42
4.4. Mel-Frequency Cepstral Coefficients 43
5. Experiment 46
5.1. Model Parameters 47
5.2. Training Procedure 48
5.2.1. Dysarthria Detection 48
5.2.2. Severity Assessment 50
5.2.3. Cross-Language 51
6. Results 52
6.1. TORGO 52
6.1.1. Dysarthria Detection 52
6.1.2. Severity Assessment 56
6.2. QoLT 57
6.2.1. Dysarthria Detection 57
6.2.2. Severity Assessment 58
6.1. Cross-Language 59
7. Discussion 62
7.1. Linguistic Implications 62
7.2. Clinical Applications 65
8. Conclusion 67
References 69
Appendix 76
Abstract in Korean 79Maste
Automated Speaker Independent Visual Speech Recognition: A Comprehensive Survey
Speaker-independent VSR is a complex task that involves identifying spoken
words or phrases from video recordings of a speaker's facial movements. Over
the years, there has been a considerable amount of research in the field of VSR
involving different algorithms and datasets to evaluate system performance.
These efforts have resulted in significant progress in developing effective VSR
models, creating new opportunities for further research in this area. This
survey provides a detailed examination of the progression of VSR over the past
three decades, with a particular emphasis on the transition from
speaker-dependent to speaker-independent systems. We also provide a
comprehensive overview of the various datasets used in VSR research and the
preprocessing techniques employed to achieve speaker independence. The survey
covers the works published from 1990 to 2023, thoroughly analyzing each work
and comparing them on various parameters. This survey provides an in-depth
analysis of speaker-independent VSR systems evolution from 1990 to 2023. It
outlines the development of VSR systems over time and highlights the need to
develop end-to-end pipelines for speaker-independent VSR. The pictorial
representation offers a clear and concise overview of the techniques used in
speaker-independent VSR, thereby aiding in the comprehension and analysis of
the various methodologies. The survey also highlights the strengths and
limitations of each technique and provides insights into developing novel
approaches for analyzing visual speech cues. Overall, This comprehensive review
provides insights into the current state-of-the-art speaker-independent VSR and
highlights potential areas for future research
Modeling DNN as human learner
In previous experiments, human listeners demonstrated that they had the ability to adapt to
unheard, ambiguous phonemes after some initial, relatively short exposures. At the same time,
previous work in the speech community has shown that pre-trained deep neural network-based
(DNN) ASR systems, like humans, also have the ability to adapt to unseen, ambiguous phonemes
after retuning their parameters on a relatively small set. In the first part of this thesis, the time-course
of phoneme category adaptation in a DNN is investigated in more detail. By retuning the
DNNs with more and more tokens with ambiguous sounds and comparing classification accuracy
of the ambiguous phonemes in a held-out test across the time-course, we found out that DNNs, like
human listeners, also demonstrated fast adaptation: the accuracy curves were step-like in almost
all cases, showing very little adaptation after seeing only one (out of ten) training bins. However,
unlike our experimental setup mentioned above, in a typical
lexically guided perceptual learning
experiment, listeners are trained with individual words instead of individual phones, and thus to truly
model such a scenario, we would require a model that could take the context of a whole utterance
into account. Traditional speech recognition systems accomplish this through the use of hidden
Markov models (HMM) and WFST decoding. In recent years, bidirectional long short-term memory (Bi-LSTM) trained under connectionist temporal classification (CTC) criterion has also attracted
much attention. In the second part of this thesis, previous experiments on ambiguous phoneme
recognition were carried out again on a new Bi-LSTM model, and phonetic transcriptions of words
ending with ambiguous phonemes were used as training targets, instead of individual sounds that
consisted of a single phoneme. We found out that despite the vastly different architecture, the
new model showed highly similar behavior in terms of classification rate over the time course of
incremental retuning. This indicated that ambiguous phonemes in a continuous context could also
be quickly adapted by neural network-based models. In the last part of this thesis, our pre-trained
Dutch Bi-LSTM from the previous part was treated as a Dutch second language learner and was
asked to transcribe English utterances in a self-adaptation scheme. In other words, we used the
Dutch model to generate phonetic transcriptions directly and retune the model on the transcriptions
it generated, although ground truth transcriptions were used to choose a subset of all self-labeled
transcriptions. Self-adaptation is of interest as a model of human second language learning, but also
has great practical engineering value, e.g., it could be used to adapt speech recognition to a lowr-resource
language. We investigated two ways to improve the adaptation scheme, with the first being
multi-task learning with articulatory feature detection during training the model on Dutch and self-labeled
adaptation, and the second being first letting the model adapt to isolated short words before
feeding it with longer utterances.Ope
A Novel Robust Mel-Energy Based Voice Activity Detector for Nonstationary Noise and Its Application for Speech Waveform Compression
The voice activity detection (VAD) is crucial in all kinds of speech applications. However, almost all existing VAD algorithms suffer from the nonstationarity of both speech and noise. To combat this difficulty, we propose a new voice activity detector, which is based on the Mel-energy features and an adaptive threshold related to the signal-to-noise ratio (SNR) estimates. In this thesis, we first justify the robustness of the Bayes classifier using the Mel-energy features over that using the Fourier spectral features in various noise environments. Then, we design an algorithm using the dynamic Mel-energy estimator and the adaptive threshold which depends on the SNR estimates. In addition, a realignment scheme is incorporated to correct the sparse-and-spurious noise estimates. Numerous simulations are carried out to evaluate the performance of our proposed VAD method and the comparisons are made with a couple existing representative schemes, namely the VAD using the likelihood ratio test with Fourier spectral energy features and that based on the enhanced time-frequency parameters. Three types of noise, namely white noise (stationary), babble noise (nonstationary) and vehicular noise (nonstationary) were artificially added by the computer for our experiments. As a result, our proposed VAD algorithm significantly outperforms other existing methods as illustrated by the corresponding receiver operating curves (ROCs). Finally, we demonstrate one of the major applications, namely speech waveform compression, associated with our new robust VAD scheme and quantify the effectiveness in terms of compression efficiency
Highly Efficient Real-Time Streaming and Fully On-Device Speaker Diarization with Multi-Stage Clustering
While recent research advances in speaker diarization mostly focus on
improving the quality of diarization results, there is also an increasing
interest in improving the efficiency of diarization systems. In this paper, we
demonstrate that a multi-stage clustering strategy that uses different
clustering algorithms for input of different lengths can address multi-faceted
challenges of on-device speaker diarization applications. Specifically, a
fallback clusterer is used to handle short-form inputs; a main clusterer is
used to handle medium-length inputs; and a pre-clusterer is used to compress
long-form inputs before they are processed by the main clusterer. Both the main
clusterer and the pre-clusterer can be configured with an upper bound of the
computational complexity to adapt to devices with different resource
constraints. This multi-stage clustering strategy is critical for streaming
on-device speaker diarization systems, where the budgets of CPU, memory and
battery are tight
A Review of Accent-Based Automatic Speech Recognition Models for E-Learning Environment
The adoption of electronics learning (e-learning) as a method of disseminating knowledge in the global educational system is growing at a rapid rate, and has created a shift in the knowledge acquisition methods from the conventional classrooms and tutors to the distributed e-learning technique that enables access to various learning resources much more conveniently and flexibly. However, notwithstanding the adaptive advantages of learner-centric contents of e-learning programmes, the distributed e-learning environment has unconsciously adopted few international languages as the languages of communication among the participants despite the various accents (mother language influence) among these participants. Adjusting to and accommodating these various accents has brought about the introduction of accents-based automatic speech recognition into the e-learning to resolve the effects of the accent differences. This paper reviews over 50 research papers to determine the development so far made in the design and implementation of accents-based automatic recognition models for the purpose of e-learning between year 2001 and 2021. The analysis of the review shows that 50% of the models reviewed adopted English language, 46.50% adopted the major Chinese and Indian languages and 3.50% adopted Swedish language as the mode of communication. It is therefore discovered that majority of the ASR models are centred on the European, American and Asian accents, while unconsciously excluding the various accents peculiarities associated with the less technologically resourced continents
- β¦