490 research outputs found

    Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System

    Full text link
    In this paper, we explore the encoding/pooling layer and loss function in the end-to-end speaker and language recognition system. First, a unified and interpretable end-to-end system for both speaker and language recognition is developed. It accepts variable-length input and produces an utterance level result. In the end-to-end system, the encoding layer plays a role in aggregating the variable-length input sequence into an utterance level representation. Besides the basic temporal average pooling, we introduce a self-attentive pooling layer and a learnable dictionary encoding layer to get the utterance level representation. In terms of loss function for open-set speaker verification, to get more discriminative speaker embedding, center loss and angular softmax loss is introduced in the end-to-end system. Experimental results on Voxceleb and NIST LRE 07 datasets show that the performance of end-to-end learning system could be significantly improved by the proposed encoding layer and loss function.Comment: Accepted for Speaker Odyssey 201

    Evaluation of preprocessors for neural network speaker verification

    Get PDF

    Phoneme Based Speaker Verification System Based on Two Stage Self-Organizing Map Design

    Get PDF
    Speaker verification is one of the pattern recognition task that authenticate a person by his or her voice. This thesis deals with a relatively new technique of classification that is the self-organizing map (SOM). Self-organizing map, as an unsupervised learning artificial neural network, rarely used as final classification step in pattern recognition task due to its relatively low accuracy. A two-stage self-organizing map design has been implemented in this thesis and showed improved results over conventional single stage design. For speech features extraction, this thesis does not introduce any new technique. A well study method that is the linear prediction analysis (LP A) has been used. Linear predictive analysis derived coefficients are extracted from segmented raw speech signal to train and test the front stage self-organizing map. Unlike other multistage or hierarchical self-organizing map designs, this thesis utilized residual vectors generated from front stage self-organizing map to train and test the second stage selforganizing map. The results showed that by breaking the classification tasks into two level or more detail resolution, an improvement of more than 5% can be obtained. Moreover, the computation time is also reduced greatly

    Speaker Identification Using Wavelet Packet Transform and Feed Forward Neural Network

    Get PDF
    It has been known for a long time that speakers can be identified from their voices. In this work we introduce a speaker identification system using wavelet packet transform. This is one of a wavelet transform analysis for feature extraction and a neural network for classification. This system is applied on ten speakers Instead of applying framing on the signal, the wavelet packet transform is applied on the whole range of the signal. This reduces the calculation time. The speech signal is decomposed into 24 sub bands, according to Mel-scale frequency. Then, for each of these bands, the log energy is taken. Finally, the discrete cosine transform is applied on these bands. These are taken as features for identifying the speaker among many speakers. For the classification task, Feed Forward multi layer perceptron, trained by backpropagation, is proposed for use as training and classification feature vectors of the speaker. We propose to construct a single neural network for each speaker of interest. Training and testing of isolated words in three cases, Vis one-, two-, and three-syllable words, were obtained by recording these words from the LAB colleagues using a low-cost microphone

    Identifikasi Pembicara Dengan Wavelet Orthogonal Coiflet

    Full text link
    Transformasi Wavelet merupakan sarana yang mulai popular pada pemrosesan sinyal, seperti citra dan suara. Transformasi Wavelet merupakan alat yang sesuai untuk menganalisis sinyal nonstationer (seperti suara) dan memiliki kemampuan yang baik dalam melokalisasi frekuensi dan waktu sinyal.Pada penelitian ini akan mengunakan tipe wavelet berbasis orthogonal yaitu coiflet berorde 4 dengan tingkat dekomposisi 10 dan 15. Pada pencocokan polanya menggunakan JST multi-layer perceptron.Hasil eksperimen menunjukkan bahwa Transformasi Wavelet merupakan suatu metode ekstraksi ciri yang handal untuk identifikasi pembicara yang mampu bersaing dengan tools lainnya dan Wavelet tipe coiflet memberikan tingkat pengenalan sebesar 84%

    Speaker Identification Using a Combination of Different Parameters as Feature Inputs to an Artificial Neural Network Classifier

    Get PDF
    This paper presents a technique using artificial neural networks (ANNs) for speaker identification that results in a better success rate compared to other techniques. The technique used in this paper uses both power spectral densities (PSDs) and linear prediction coefficients (LPCs) as feature inputs to a self organizing feature map to achieve a better identification performance. Results for speaker identification with different methods are presented and compared

    Robust speaker identification using artificial neural networks

    Full text link
    This research mainly focuses on recognizing the speakers through their speech samples. Numerous Text-Dependent or Text-Independent algorithms have been developed by people so far, to recognize the speaker from his/her speech. In this thesis, we concentrate on the recognition of the speaker from the fixed text i.e. Text-Dependent . Possibility of extending this method to variable text i.e. Text-Independent is also analyzed. Different feature extraction algorithms are employed and their performance with Artificial Neural Networks as a Data Classifier on a fixed training set is analyzed. We find a way to combine all these individual feature extraction algorithms by incorporating their interdependence. The efficiency of these algorithms is determined after the input speech is classified using Back Propagation Algorithm of Artificial Neural Networks. A special case of Back Propagation Algorithm which improves the efficiency of the classification is also discussed

    Efficient Approaches for Voice Change and Voice Conversion Systems

    Get PDF
    In this thesis, the study and design of Voice Change and Voice Conversion systems are presented. Particularly, a voice change system manipulates a speaker’s voice to be perceived as it is not spoken by this speaker; and voice conversion system modifies a speaker’s voice, such that it is perceived as being spoken by a target speaker. This thesis mainly includes two sub-parts. The first part is to develop a low latency and low complexity voice change system (i.e. includes frequency/pitch scale modification and formant scale modification algorithms), which can be executed on the smartphones in 2012 with very limited computational capability. Although some low-complexity voice change algorithms have been proposed and studied, the real-time implementations are very rare. According to the experimental results, the proposed voice change system achieves the same quality as the baseline approach but requires much less computational complexity and satisfies the requirement of real-time. Moreover, the proposed system has been implemented in C language and was released as a commercial software application. The second part of this thesis is to investigate a novel low-complexity voice conversion system (i.e. from a source speaker A to a target speaker B) that improves the perceptual quality and identity without introducing large processing latencies. The proposed scheme directly manipulates the spectrum using an effective and physically motivated method – Continuous Frequency Warping and Magnitude Scaling (CFWMS) to guarantee high perceptual naturalness and quality. In addition, a trajectory limitation strategy is proposed to prevent the frame-by-frame discontinuity to further enhance the speech quality. The experimental results show that the proposed method outperforms the conventional baseline solutions in terms of either objective tests or subjective tests

    Identification of Age Voiceprint Using Machine Learning Algorithms

    Get PDF
    The voice is considered a biometric trait since we can extract information from the speech signal that allows us to identify the person speaking in a specific recording. Fingerprints, iris, DNA, or speech can be used in biometric systems, with speech being the most intuitive, basic, and easy to create characteristic. Speech-based services are widely used in the banking and mobile sectors, although these services do not employ voice recognition to identify consumers. As a result, the possibility of using these services under a fake name is always there. To reduce the possibility of fraudulent identification, voice-based recognition systems must be designed. In this research, Mel Frequency Cepstral Coefficients (MFCC) characteristics were retrieved from the gathered voice samples to train five different machine learning algorithms, namely, the decision tree, random forest (RF), support vector machines (SVM), closest neighbor (k-NN), and multi-layer sensor (MLP). Accuracy, precision, recall, specificity, and F1 score were used as classification performance metrics to compare these algorithms. According to the findings of the study, the MLP approach had a high classification accuracy of 91%. In addition, it seems that RF performs better than other measurements. This finding demonstrates how these categorization algorithms may assist voice-based biometric systems
    corecore