204 research outputs found

    Employing Emotion Cues to Verify Speakers in Emotional Talking Environments

    Full text link
    Usually, people talk neutrally in environments where there are no abnormal talking conditions such as stress and emotion. Other emotional conditions that might affect people talking tone like happiness, anger, and sadness. Such emotions are directly affected by the patient health status. In neutral talking environments, speakers can be easily verified, however, in emotional talking environments, speakers cannot be easily verified as in neutral talking ones. Consequently, speaker verification systems do not perform well in emotional talking environments as they do in neutral talking environments. In this work, a two-stage approach has been employed and evaluated to improve speaker verification performance in emotional talking environments. This approach employs speaker emotion cues (text-independent and emotion-dependent speaker verification problem) based on both Hidden Markov Models (HMMs) and Suprasegmental Hidden Markov Models (SPHMMs) as classifiers. The approach is comprised of two cascaded stages that combines and integrates emotion recognizer and speaker recognizer into one recognizer. The architecture has been tested on two different and separate emotional speech databases: our collected database and Emotional Prosody Speech and Transcripts database. The results of this work show that the proposed approach gives promising results with a significant improvement over previous studies and other approaches such as emotion-independent speaker verification approach and emotion-dependent speaker verification approach based completely on HMMs.Comment: Journal of Intelligent Systems, Special Issue on Intelligent Healthcare Systems, De Gruyter, 201

    Hybrid Method for Digits Recognition using Fixed-Frame Scores and Derived Pitch

    Get PDF
    This paper presents a procedure of frame normalization based on the traditional dynamic time warping (DTW) using the LPC coefficients. The redefined method is called as the DTW frame-fixing method (DTW-FF), it works by normalizing the word frames of the input against the reference frames. The enthusiasm to this study is due to neural network limitation that entails a fix number of input nodes for when processing multiple inputs in parallel. Due to this problem, this research is initiated to reduce the amount of computation and complexity in a neural network by reducing the number of inputs into the network. In this study, dynamic warping process is used, in which local distance scores of the warping path are fixed and collected so that their scores are of equal number of frames. Also studied in this paper is the consideration of pitch as a contributing feature to the speech recognition. Results showed a good performance and improvement when using pitch along with DTW-FF feature. The convergence rate between using the steepest gradient descent is also compared to another method namely conjugate gradient method. Convergence rate is also improved when conjugate gradient method is introduced in the back-propagation algorithm

    Significance of Vowel Onset Point Information for Speaker Verification

    Get PDF
    This work demonstrates the significance of information about vowel onset points (VOPs) for speaker verification. VOP is defined as the instant at which the onset of vowel takes place. Vowel-like regions can be identified using VOPs. By production, vowel-like regions have impulse-like excitation and therefore impulse-response of vocal tract system is better manifested in them, and are relatively high signal to noise ratio (SNR) regions. Speaker information extracted from such regions may therefore be more discriminative. Due to this better speaker modeling and reliable testing may be possible using the features extracted from vowel-like regions. It is demonstrated in this work that for clean and matched conditions, relatively less number of frames from vowel-like regions are sufficient for speaker modeling and testing. Alternatively, for degraded and mismatched conditions, vowel-like regions provide better performanc

    Limited Data Speaker Verification: Fusion of Features

    Get PDF
    The present work demonstrates experimental evaluation of speaker verification for different speech feature extraction techniques with the constraints of limited data (less than 15 seconds). The state-of-the-art speaker verification techniques provide good performance for sufficient data (greater than 1 minutes). It is a challenging task to develop techniques which perform well for speaker verification under limited data condition. In this work different features like Mel Frequency Cepstral Coefficients (MFCC), Linear Prediction Cepstral Coefficients (LPCC), Delta (4), Delta-Delta (44), Linear Prediction Residual (LPR) and Linear Prediction Residual Phase (LPRP) are considered. The performance of individual features is studied and for better verification performance, combination of these features is attempted. A comparative study is made between Gaussian mixture model (GMM) and GMM-universal background model (GMM-UBM) through experimental evaluation. The experiments are conducted using NIST-2003 database. The experimental results show that, the combination of features provides better performance compared to the individual features. Further GMM-UBM modeling gives reduced equal error rate (EER) as compared to GMM

    Nasality in automatic speaker verification

    Get PDF

    CAPT를 위한 발음 변이 분석 및 CycleGAN 기반 피드백 생성

    Get PDF
    학위논문(박사)--서울대학교 대학원 :인문대학 협동과정 인지과학전공,2020. 2. 정민화.Despite the growing popularity in learning Korean as a foreign language and the rapid development in language learning applications, the existing computer-assisted pronunciation training (CAPT) systems in Korean do not utilize linguistic characteristics of non-native Korean speech. Pronunciation variations in non-native speech are far more diverse than those observed in native speech, which may pose a difficulty in combining such knowledge in an automatic system. Moreover, most of the existing methods rely on feature extraction results from signal processing, prosodic analysis, and natural language processing techniques. Such methods entail limitations since they necessarily depend on finding the right features for the task and the extraction accuracies. This thesis presents a new approach for corrective feedback generation in a CAPT system, in which pronunciation variation patterns and linguistic correlates with accentedness are analyzed and combined with a deep neural network approach, so that feature engineering efforts are minimized while maintaining the linguistically important factors for the corrective feedback generation task. Investigations on non-native Korean speech characteristics in contrast with those of native speakers, and their correlation with accentedness judgement show that both segmental and prosodic variations are important factors in a Korean CAPT system. The present thesis argues that the feedback generation task can be interpreted as a style transfer problem, and proposes to evaluate the idea using generative adversarial network. A corrective feedback generation model is trained on 65,100 read utterances by 217 non-native speakers of 27 mother tongue backgrounds. The features are automatically learnt in an unsupervised way in an auxiliary classifier CycleGAN setting, in which the generator learns to map a foreign accented speech to native speech distributions. In order to inject linguistic knowledge into the network, an auxiliary classifier is trained so that the feedback also identifies the linguistic error types that were defined in the first half of the thesis. The proposed approach generates a corrected version the speech using the learners own voice, outperforming the conventional Pitch-Synchronous Overlap-and-Add method.외국어로서의 한국어 교육에 대한 관심이 고조되어 한국어 학습자의 수가 크게 증가하고 있으며, 음성언어처리 기술을 적용한 컴퓨터 기반 발음 교육(Computer-Assisted Pronunciation Training; CAPT) 어플리케이션에 대한 연구 또한 적극적으로 이루어지고 있다. 그럼에도 불구하고 현존하는 한국어 말하기 교육 시스템은 외국인의 한국어에 대한 언어학적 특징을 충분히 활용하지 않고 있으며, 최신 언어처리 기술 또한 적용되지 않고 있는 실정이다. 가능한 원인으로써는 외국인 발화 한국어 현상에 대한 분석이 충분하게 이루어지지 않았다는 점, 그리고 관련 연구가 있어도 이를 자동화된 시스템에 반영하기에는 고도화된 연구가 필요하다는 점이 있다. 뿐만 아니라 CAPT 기술 전반적으로는 신호처리, 운율 분석, 자연어처리 기법과 같은 특징 추출에 의존하고 있어서 적합한 특징을 찾고 이를 정확하게 추출하는 데에 많은 시간과 노력이 필요한 실정이다. 이는 최신 딥러닝 기반 언어처리 기술을 활용함으로써 이 과정 또한 발전의 여지가 많다는 바를 시사한다. 따라서 본 연구는 먼저 CAPT 시스템 개발에 있어 발음 변이 양상과 언어학적 상관관계를 분석하였다. 외국인 화자들의 낭독체 변이 양상과 한국어 원어민 화자들의 낭독체 변이 양상을 대조하고 주요한 변이를 확인한 후, 상관관계 분석을 통하여 의사소통에 영향을 미치는 중요도를 파악하였다. 그 결과, 종성 삭제와 3중 대립의 혼동, 초분절 관련 오류가 발생할 경우 피드백 생성에 우선적으로 반영하는 것이 필요하다는 것이 확인되었다. 교정된 피드백을 자동으로 생성하는 것은 CAPT 시스템의 중요한 과제 중 하나이다. 본 연구는 이 과제가 발화의 스타일 변화의 문제로 해석이 가능하다고 보았으며, 생성적 적대 신경망 (Cycle-consistent Generative Adversarial Network; CycleGAN) 구조에서 모델링하는 것을 제안하였다. GAN 네트워크의 생성모델은 비원어민 발화의 분포와 원어민 발화 분포의 매핑을 학습하며, Cycle consistency 손실함수를 사용함으로써 발화간 전반적인 구조를 유지함과 동시에 과도한 교정을 방지하였다. 별도의 특징 추출 과정이 없이 필요한 특징들이 CycleGAN 프레임워크에서 무감독 방법으로 스스로 학습되는 방법으로, 언어 확장이 용이한 방법이다. 언어학적 분석에서 드러난 주요한 변이들 간의 우선순위는 Auxiliary Classifier CycleGAN 구조에서 모델링하는 것을 제안하였다. 이 방법은 기존의 CycleGAN에 지식을 접목시켜 피드백 음성을 생성함과 동시에 해당 피드백이 어떤 유형의 오류인지 분류하는 문제를 수행한다. 이는 도메인 지식이 교정 피드백 생성 단계까지 유지되고 통제가 가능하다는 장점이 있다는 데에 그 의의가 있다. 본 연구에서 제안한 방법을 평가하기 위해서 27개의 모국어를 갖는 217명의 유의미 어휘 발화 65,100개로 피드백 자동 생성 모델을 훈련하고, 개선 여부 및 정도에 대한 지각 평가를 수행하였다. 제안된 방법을 사용하였을 때 학습자 본인의 목소리를 유지한 채 교정된 발음으로 변환하는 것이 가능하며, 전통적인 방법인 음높이 동기식 중첩가산 (Pitch-Synchronous Overlap-and-Add) 알고리즘을 사용하는 방법에 비해 상대 개선률 16.67%이 확인되었다.Chapter 1. Introduction 1 1.1. Motivation 1 1.1.1. An Overview of CAPT Systems 3 1.1.2. Survey of existing Korean CAPT Systems 5 1.2. Problem Statement 7 1.3. Thesis Structure 7 Chapter 2. Pronunciation Analysis of Korean Produced by Chinese 9 2.1. Comparison between Korean and Chinese 11 2.1.1. Phonetic and Syllable Structure Comparisons 11 2.1.2. Phonological Comparisons 14 2.2. Related Works 16 2.3. Proposed Analysis Method 19 2.3.1. Corpus 19 2.3.2. Transcribers and Agreement Rates 22 2.4. Salient Pronunciation Variations 22 2.4.1. Segmental Variation Patterns 22 2.4.1.1. Discussions 25 2.4.2. Phonological Variation Patterns 26 2.4.1.2. Discussions 27 2.5. Summary 29 Chapter 3. Correlation Analysis of Pronunciation Variations and Human Evaluation 30 3.1. Related Works 31 3.1.1. Criteria used in L2 Speech 31 3.1.2. Criteria used in L2 Korean Speech 32 3.2. Proposed Human Evaluation Method 36 3.2.1. Reading Prompt Design 36 3.2.2. Evaluation Criteria Design 37 3.2.3. Raters and Agreement Rates 40 3.3. Linguistic Factors Affecting L2 Korean Accentedness 41 3.3.1. Pearsons Correlation Analysis 41 3.3.2. Discussions 42 3.3.3. Implications for Automatic Feedback Generation 44 3.4. Summary 45 Chapter 4. Corrective Feedback Generation for CAPT 46 4.1. Related Works 46 4.1.1. Prosody Transplantation 47 4.1.2. Recent Speech Conversion Methods 49 4.1.3. Evaluation of Corrective Feedback 50 4.2. Proposed Method: Corrective Feedback as a Style Transfer 51 4.2.1. Speech Analysis at Spectral Domain 53 4.2.2. Self-imitative Learning 55 4.2.3. An Analogy: CAPT System and GAN Architecture 57 4.3. Generative Adversarial Networks 59 4.3.1. Conditional GAN 61 4.3.2. CycleGAN 62 4.4. Experiment 63 4.4.1. Corpus 64 4.4.2. Baseline Implementation 65 4.4.3. Adversarial Training Implementation 65 4.4.4. Spectrogram-to-Spectrogram Training 66 4.5. Results and Evaluation 69 4.5.1. Spectrogram Generation Results 69 4.5.2. Perceptual Evaluation 70 4.5.3. Discussions 72 4.6. Summary 74 Chapter 5. Integration of Linguistic Knowledge in an Auxiliary Classifier CycleGAN for Feedback Generation 75 5.1. Linguistic Class Selection 75 5.2. Auxiliary Classifier CycleGAN Design 77 5.3. Experiment and Results 80 5.3.1. Corpus 80 5.3.2. Feature Annotations 81 5.3.3. Experiment Setup 81 5.3.4. Results 82 5.4. Summary 84 Chapter 6. Conclusion 86 6.1. Thesis Results 86 6.2. Thesis Contributions 88 6.3. Recommendations for Future Work 89 Bibliography 91 Appendix 107 Abstract in Korean 117 Acknowledgments 120Docto

    Discriminative features for GMM and i-vector based speaker diarization

    Get PDF
    Speaker diarization has received several research attentions over the last decade. Among the different domains of speaker diarization, diarization in meeting domain is the most challenging one. It usually contains spontaneous speech and is, for example, susceptible to reverberation. The appropriate selection of speech features is one of the factors that affect the performance of speaker diarization systems. Mel Frequency Cepstral Coefficients (MFCC) are the most widely used short-term speech features in speaker diarization. Other factors that affect the performance of speaker diarization systems are the techniques employed to perform both speaker segmentation and speaker clustering. In this thesis, we have proposed the use of jitter and shimmer long-term voice-quality features both for Gaussian Mixture Modeling (GMM) and i-vector based speaker diarization systems. The voice-quality features are used together with the state-of-the-art short-term cepstral and long-term speech ones. The long-term features consist of prosody and Glottal-to-Noise excitation ratio (GNE) descriptors. Firstly, the voice-quality, prosodic and GNE features are stacked in the same feature vector. Then, they are fused with cepstral coefficients at the score likelihood level both for the proposed Gaussian Mixture Modeling (GMM) and i-vector based speaker diarization systems. For the proposed GMM based speaker diarization system, independent HMM models are estimated from the short-term and long-term speech feature sets. The fusion of the short-term descriptors with the long-term ones in speaker segmentation is carried out by linearly weighting the log-likelihood scores of Viterbi decoding. In the case of speaker clustering, the fusion of the short-term cepstral features with the long-term ones is carried out by linearly fusing the Bayesian Information Criterion (BIC) scores corresponding to these feature sets. For the proposed i-vector based speaker diarization system, the speaker segmentation is carried out exactly the same as in the previously mentioned GMM based speaker diarization system. However, the speaker clustering technique is based on the recently introduced factor analysis paradigm. Two set of i-vectors are extracted from the speaker segmentation hypothesis. Whilst the first i-vector is extracted from short-term cepstral features, the second one is extracted from the voice quality, prosody and GNE descriptors. Then, the cosine-distance and Probabilistic Linear Discriminant Analysis (PLDA) scores of i-vectors are linearly weighted to obtain a fused similarity score. Finally, the fused score is used as speaker clustering distance. We have also proposed the use of delta dynamic features for speaker clustering. The motivation for using deltas in clustering is that delta dynamic features capture the transitional characteristics of the speech signal which contain speaker specific information. This information is not captured by the static cepstral coefficients. The delta features are used together with the short-term static cepstral coefficients and long-term speech features (i.e., voice-quality, prosody and GNE) both for GMM and i-vector based speaker diarization systems. The experiments have been carried out on Augmented Multi-party Interaction (AMI) meeting corpus. The experimental results show that the use of voice-quality, prosody, GNE and delta dynamic features improve the performance of both GMM and i-vector based speaker diarization systems.La diarización del altavoz ha recibido varias atenciones de investigación durante la última década. Entre los diferentes dominios de la diarización del hablante, la diarización en el dominio del encuentro es la más difícil. Normalmente contiene habla espontánea y, por ejemplo, es susceptible de reverberación. La selección apropiada de las características del habla es uno de los factores que afectan el rendimiento de los sistemas de diarización de los altavoces. Los Coeficientes Cepstral de Frecuencia Mel (MFCC) son las características de habla de corto plazo más utilizadas en la diarización de los altavoces. Otros factores que afectan el rendimiento de los sistemas de diarización del altavoz son las técnicas empleadas para realizar tanto la segmentación del altavoz como el agrupamiento de altavoces. En esta tesis, hemos propuesto el uso de jitter y shimmer características de calidad de voz a largo plazo tanto para GMM y i-vector basada en sistemas de diarización de altavoces. Las características de calidad de voz se utilizan junto con el estado de la técnica a corto plazo cepstral y de larga duración de habla. Las características a largo plazo consisten en la prosodia y los descriptores de relación de excitación Glottal-a-Ruido (GNE). En primer lugar, las características de calidad de voz, prosódica y GNE se apilan en el mismo vector de características. A continuación, se fusionan con coeficientes cepstrales en el nivel de verosimilitud de puntajes tanto para los sistemas de diarización de altavoces basados ¿¿en el modelo Gaussian Mixture Modeling (GMM) como en los sistemas basados ¿¿en i-vector. . Para el sistema de diarización de altavoces basado en GMM propuesto, se calculan modelos HMM independientes a partir de cada conjunto de características. En la segmentación de los altavoces, la fusión de los descriptores a corto plazo con los de largo plazo se lleva a cabo mediante la ponderación lineal de las puntuaciones log-probabilidad de decodificación Viterbi. En la agrupación de altavoces, la fusión de las características cepstrales a corto plazo con las de largo plazo se lleva a cabo mediante la fusión lineal de las puntuaciones Bayesian Information Criterion (BIC) correspondientes a estos conjuntos de características. Para el sistema de diarización de altavoces basado en un vector i, la fusión de características se realiza exactamente igual a la del sistema basado en GMM antes mencionado. Sin embargo, la técnica de agrupación de altavoces se basa en el paradigma de análisis de factores recientemente introducido. Dos conjuntos de i-vectores se extraen de la hipótesis de segmentación de altavoz. Mientras que el primer vector i se extrae de características espectrales a corto plazo, el segundo se extrae de los descriptores de calidad de voz apilados, prosódicos y GNE. A continuación, las puntuaciones de coseno-distancia y Probabilistic Linear Discriminant Analysis (PLDA) entre i-vectores se ponderan linealmente para obtener una puntuación de similitud fundida. Finalmente, la puntuación fusionada se utiliza como distancia de agrupación de altavoces. También hemos propuesto el uso de características dinámicas delta para la agrupación de locutores. La motivación para el uso de deltas en la agrupación es que las características dinámicas delta capturan las características de transición de la señal de voz que contienen información específica del locutor. Esta información no es capturada por los coeficientes cepstrales estáticos. Las características delta se usan junto con los coeficientes cepstrales estáticos a corto plazo y las características de voz a largo plazo (es decir, calidad de voz, prosodia y GNE) tanto para sistemas de diarización de altavoces basados en GMM como en sistemas i-vector. Los resultados experimentales sobre AMI muestran que el uso de calidad vocal, prosódica, GNE y dinámicas delta mejoran el rendimiento de los sistemas de diarización de altavoces basados en GMM e i-vector.Postprint (published version
    corecore