347 research outputs found

    Speech Recognition in noisy environment using Deep Learning Neural Network

    Get PDF
    Recent researches in the field of automatic speaker recognition have shown that methods based on deep learning neural networks provide better performance than other statistical classifiers. On the other hand, these methods usually require adjustment of a significant number of parameters. The goal of this thesis is to show that selecting appropriate value of parameters can significantly improve speaker recognition performance of methods based on deep learning neural networks. The reported study introduces an approach to automatic speaker recognition based on deep neural networks and the stochastic gradient descent algorithm. It particularly focuses on three parameters of the stochastic gradient descent algorithm: the learning rate, and the hidden and input layer dropout rates. Additional attention was devoted to the research question of speaker recognition under noisy conditions. Thus, two experiments were conducted in the scope of this thesis. The first experiment was intended to demonstrate that the optimization of the observed parameters of the stochastic gradient descent algorithm can improve speaker recognition performance under no presence of noise. This experiment was conducted in two phases. In the first phase, the recognition rate is observed when the hidden layer dropout rate and the learning rate are varied, while the input layer dropout rate was constant. In the second phase of this experiment, the recognition rate is observed when the input layers dropout rate and learning rate are varied, while the hidden layer dropout rate was constant. The second experiment was intended to show that the optimization of the observed parameters of the stochastic gradient descent algorithm can improve speaker recognition performance even under noisy conditions. Thus, different noise levels were artificially applied on the original speech signal

    Speaker Recognition Systems: A Tutorial

    Full text link
    Abstract This paper gives an overview of speaker recognition systems. Speaker recognition is the task of automatically recognizing who is speaking by identifying an unknown speaker among several reference speakers using speaker-specific information included in speech waves. The different classification of speaker recognition and speech processing techniques required for performing the recognition task are discussed. The basic modules of a speaker recognition system are outlined and discussed. Some of the techniques required to implement each module of the system were discussed and others are mentioned. The methods were also compared with one another. Finally, this paper concludes by giving a few research trends in speaker recognition for some year to come

    Speaker gender recognition system

    Get PDF
    Abstract. Automatic gender recognition through speech is one of the fundamental mechanisms in human-machine interaction. Typical application areas of this technology range from gender-targeted advertising to gender-specific IoT (Internet of Things) applications. It can also be used to narrow down the scope of investigations in crime scenarios. There are many possible methods of recognizing the gender of a speaker. In machine learning applications, the first step is to acquire and convert the natural human voice into a form of machine understandable signal. Useful voice features then could be extracted and labelled with gender information so that are then trained by machines. After that, new input voice can be captured and processed and the machine is able to extract the features by pattern modelling. In this thesis, a real-time speaker gender recognition system was designed within Matlab environment. This system could automatically identify the gender of a speaker by voice. The implementation work utilized voice processing and feature extraction techniques to deal with an input speech coming from a microphone or a recorded speech file. The response features are extracted and classified. Then the machine learning classification method (Naïve Bayes Classifier) is used to distinguish the gender features. The recognition result with gender information is then finally displayed. The evaluation of the speaker gender recognition systems was done in an experiment with 40 participants (half male and half female) in a quite small room. The experiment recorded 400 speech samples by speakers from 16 countries in 17 languages. These 400 speech samples were tested by the gender recognition system and showed a considerably good performance, with only 29 errors of recognition (92.75% accuracy). In comparison with previous speaker gender recognition systems, most of them obtained the accuracy no more than 90% and only one obtained 100% accuracy with very limited testers. We can then conclude that the performance of the speaker gender recognition system designed in this thesis is reliable

    Optimization of data-driven filterbank for automatic speaker verification

    Get PDF
    Most of the speech processing applications use triangular filters spaced in mel-scale for feature extraction. In this paper, we propose a new data-driven filter design method which optimizes filter parameters from a given speech data. First, we introduce a frame-selection based approach for developing speech-signal-based frequency warping scale. Then, we propose a new method for computing the filter frequency responses by using principal component analysis (PCA). The main advantage of the proposed method over the recently introduced deep learning based methods is that it requires very limited amount of unlabeled speech-data. We demonstrate that the proposed filterbank has more speaker discriminative power than commonly used mel filterbank as well as existing data-driven filterbank. We conduct automatic speaker verification (ASV) experiments with different corpora using various classifier back-ends. We show that the acoustic features created with proposed filterbank are better than existing mel-frequency cepstral coefficients (MFCCs) and speech-signal-based frequency cepstral coefficients (SFCCs) in most cases. In the experiments with VoxCeleb1 and popular i-vector back-end, we observe 9.75% relative improvement in equal error rate (EER) over MFCCs. Similarly, the relative improvement is 4.43% with recently introduced x-vector system. We obtain further improvement using fusion of the proposed method with standard MFCC-based approach.Comment: Published in Digital Signal Processing journal (Elsevier

    Subband spectral features for speaker recognition.

    Get PDF
    Tam Yuk Yin.Thesis (M.Phil.)--Chinese University of Hong Kong, 2004.Includes bibliographical references.Abstracts in English and Chinese.Chapter Chapter 1 --- Introduction --- p.1Chapter 1.1. --- Biometrics for User Authentication --- p.2Chapter 1.2. --- Voice-based User Authentication --- p.6Chapter 1.3. --- Motivation and Focus of This Work --- p.7Chapter 1.4. --- Thesis Outline --- p.9References --- p.11Chapter Chapter 2 --- Fundamentals of Automatic Speaker Recognition --- p.14Chapter 2.1. --- Speech Production --- p.14Chapter 2.2. --- Features of Speaker's Voice in Speech Signal --- p.16Chapter 2.3. --- Basics of Speaker Recognition --- p.19Chapter 2.4. --- Existing Approaches of Speaker Recognition --- p.20Chapter 2.4.1. --- Feature Extraction --- p.21Chapter 2.4.1.1 --- Overview --- p.21Chapter 2.4.1.2 --- Mel-Frequency Cepstral Coefficient (MFCC) --- p.21Chapter 2.4.2. --- Speaker Modeling --- p.24Chapter 2.4.2.1 --- Overview --- p.24Chapter 2.4.2.2 --- Gaussian Mixture Model (GMM) --- p.25Chapter 2.4.3. --- Speaker Identification (SID) --- p.26References --- p.29Chapter Chapter 3 --- Data Collection and Baseline System --- p.32Chapter 3.1. --- Data Collection --- p.32Chapter 3.2. --- Baseline System --- p.36Chapter 3.2.1. --- Experimental Set-up --- p.36Chapter 3.2.2. --- Results and Analysis --- p.39References --- p.42Chapter Chapter 4 --- Subband Spectral Envelope Features --- p.44Chapter 4.1. --- Spectral Envelope Features --- p.44Chapter 4.2. --- Subband Spectral Envelope Features --- p.46Chapter 4.3. --- Feature Extraction Procedures --- p.52Chapter 4.4. --- SID Experiments --- p.55Chapter 4.4.1. --- Experimental Set-up --- p.55Chapter 4.4.2. --- Results and Analysis --- p.55References --- p.62Chapter Chapter 5 --- Fusion of Subband Features --- p.63Chapter 5.1. --- Model Level Fusion --- p.63Chapter 5.1.1. --- Experimental Set-up --- p.63Chapter 5.1.2. --- "Results and Analysis," --- p.65Chapter 5.2. --- Feature Level Fusion --- p.69Chapter 5.2.1. --- Experimental Set-up --- p.70Chapter 5.2.2. --- "Results and Analysis," --- p.71Chapter 5.3. --- Discussion --- p.73References --- p.75Chapter Chapter 6 --- Utterance-Level SID with Text-Dependent Weights --- p.77Chapter 6.1. --- Motivation --- p.77Chapter 6.2. --- Utterance-Level SID --- p.78Chapter 6.3. --- Baseline System --- p.79Chapter 6.3.1. --- Implementation Details --- p.79Chapter 6.3.2. --- "Results and Analysis," --- p.80Chapter 6.4. --- Text-Dependent Weights --- p.81Chapter 6.4.1. --- Implementation Details --- p.81Chapter 6.4.2. --- "Results and Analysis," --- p.83Chapter 6.5. --- Text-Dependent Feature Weights --- p.86Chapter 6.5.1. --- Implementation Details --- p.86Chapter 6.5.2. --- "Results and Analysis," --- p.87Chapter 6.6. --- Text-Dependent Weights Applied in Score Combination and Subband Features --- p.88Chapter 6.6.1. --- Implementation Details --- p.89Chapter 6.6.2. --- Results and Analysis --- p.89Chapter 6.7. --- Discussion --- p.90Chapter Chapter 7 --- Conclusions and Suggested Future Work --- p.92Chapter 7.1. --- Conclusions --- p.92Chapter 7.2. --- Suggested Future Work --- p.94Appendix --- p.96Appendix 1 Speech Content for Data Collection --- p.9

    Arabic Isolated Word Speaker Dependent Recognition System

    Get PDF
    In this thesis we designed a new Arabic isolated word speaker dependent recognition system based on a combination of several features extraction and classifications techniques. Where, the system combines the methods outputs using a voting rule. The system is implemented with a graphic user interface under Matlab using G62 Core I3/2.26 Ghz processor laptop. The dataset used in this system include 40 Arabic words recorded in a calm environment with 5 different speakers using laptop microphone. Each speaker will read each word 8 times. 5 of them are used in training and the remaining are used in the test phase. First in the preprocessing step we used an endpoint detection technique based on energy and zero crossing rates to identify the start and the end of each word and remove silences then we used a discrete wavelet transform to remove noise from signal. In order to accelerate the system and reduce the execution time we make the system first to recognize the speaker and load only the reference model of that user. We compared 5 different methods which are pairwise Euclidean distance with MelFrequency cepstral coefficients (MFCC), Dynamic Time Warping (DTW) with Formants features, Gaussian Mixture Model (GMM) with MFCC, MFCC+DTW and Itakura distance with Linear Predictive Coding features (LPC) and we got a recognition rate of 85.23%, 57% , 87%, 90%, 83% respectively. In order to improve the accuracy of the system, we tested several combinations of these 5 methods. We find that the best combination is MFCC | Euclidean + Formant | DTW + MFCC | DTW + LPC | Itakura with an accuracy of 94.39% but with large computation time of 2.9 seconds. In order to reduce the computation time of this hybrid, we compare several subcombination of it and find that the best performance in trade off computation time is by first combining MFCC | Euclidean + LPC | Itakura and only when the two methods do not match the system will add Formant | DTW + MFCC | DTW methods to the combination, where the average computation time is reduced to the half to 1.56 seconds and the system accuracy is improved to 94.56%. Finally, the proposed system is good and competitive compared with other previous researches

    Open-set Speaker Identification

    Get PDF
    This study is motivated by the growing need for effective extraction of intelligence and evidence from audio recordings in the fight against crime, a need made ever more apparent with the recent expansion of criminal and terrorist organisations. The main focus is to enhance open-set speaker identification process within the speaker identification systems, which are affected by noisy audio data obtained under uncontrolled environments such as in the street, in restaurants or other places of businesses. Consequently, two investigations are initially carried out including the effects of environmental noise on the accuracy of open-set speaker recognition, which thoroughly cover relevant conditions in the considered application areas, such as variable training data length, background noise and real world noise, and the effects of short and varied duration reference data in open-set speaker recognition. The investigations led to a novel method termed “vowel boosting” to enhance the reliability in speaker identification when operating with varied duration speech data under uncontrolled conditions. Vowels naturally contain more speaker specific information. Therefore, by emphasising this natural phenomenon in speech data, it enables better identification performance. The traditional state-of-the-art GMM-UBMs and i-vectors are used to evaluate “vowel boosting”. The proposed approach boosts the impact of the vowels on the speaker scores, which improves the recognition accuracy for the specific case of open-set identification with short and varied duration of speech material
    corecore