490 research outputs found
Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System
In this paper, we explore the encoding/pooling layer and loss function in the
end-to-end speaker and language recognition system. First, a unified and
interpretable end-to-end system for both speaker and language recognition is
developed. It accepts variable-length input and produces an utterance level
result. In the end-to-end system, the encoding layer plays a role in
aggregating the variable-length input sequence into an utterance level
representation. Besides the basic temporal average pooling, we introduce a
self-attentive pooling layer and a learnable dictionary encoding layer to get
the utterance level representation. In terms of loss function for open-set
speaker verification, to get more discriminative speaker embedding, center loss
and angular softmax loss is introduced in the end-to-end system. Experimental
results on Voxceleb and NIST LRE 07 datasets show that the performance of
end-to-end learning system could be significantly improved by the proposed
encoding layer and loss function.Comment: Accepted for Speaker Odyssey 201
Phoneme Based Speaker Verification System Based on Two Stage Self-Organizing Map Design
Speaker verification is one of the pattern recognition task that authenticate a person by his or her voice. This thesis deals with a relatively new technique of classification
that is the self-organizing map (SOM). Self-organizing map, as an unsupervised learning artificial neural network, rarely used as final classification step in pattern
recognition task due to its relatively low accuracy. A two-stage self-organizing map design has been implemented in this thesis and showed improved results over conventional single stage design. For speech features extraction, this thesis does not introduce any new technique. A well study method that is the linear prediction analysis (LP A) has been used. Linear predictive analysis derived coefficients are extracted from segmented raw speech signal to train and test the front stage self-organizing map. Unlike other multistage or hierarchical self-organizing map designs, this thesis utilized residual vectors generated from front stage self-organizing map to train and test the second stage selforganizing map. The results showed that by breaking the classification tasks into two level or more detail resolution, an improvement of more than 5% can be obtained. Moreover, the computation time is also reduced greatly
Recommended from our members
Biologically inspired speaker verification
Speaker verification is an active research problem that has been addressed using a variety of different classification techniques. However, in general, methods inspired by the human auditory system tend to show better verification performance than other methods. In this thesis three biologically inspired speaker verification algorithms are presented
Speaker Identification Using Wavelet Packet Transform and Feed Forward Neural Network
It has been known for a long time that speakers can be identified from their voices. In this work we introduce a speaker identification system using wavelet packet transform. This is one of a wavelet transform analysis for feature extraction and a neural network for classification. This system is applied on ten speakers
Instead of applying framing on the signal, the wavelet packet transform is applied on the whole range of the signal. This reduces the calculation time. The speech signal is decomposed into 24 sub bands, according to Mel-scale frequency. Then, for each of these bands, the log energy is taken. Finally, the discrete cosine transform is applied on these bands. These are taken as features for identifying the speaker among many speakers.
For the classification task, Feed Forward multi layer perceptron, trained by backpropagation, is proposed for use as training and classification feature vectors of the speaker. We propose to construct a single neural network for each speaker of interest.
Training and testing of isolated words in three cases, Vis one-, two-, and three-syllable words, were obtained by recording these words from the LAB colleagues using a low-cost microphone
Identifikasi Pembicara Dengan Wavelet Orthogonal Coiflet
Transformasi Wavelet merupakan sarana yang mulai popular pada pemrosesan sinyal, seperti citra dan suara. Transformasi Wavelet merupakan alat yang sesuai untuk menganalisis sinyal nonstationer (seperti suara) dan memiliki kemampuan yang baik dalam melokalisasi frekuensi dan waktu sinyal.Pada penelitian ini akan mengunakan tipe wavelet berbasis orthogonal yaitu coiflet berorde 4 dengan tingkat dekomposisi 10 dan 15. Pada pencocokan polanya menggunakan JST multi-layer perceptron.Hasil eksperimen menunjukkan bahwa Transformasi Wavelet merupakan suatu metode ekstraksi ciri yang handal untuk identifikasi pembicara yang mampu bersaing dengan tools lainnya dan Wavelet tipe coiflet memberikan tingkat pengenalan sebesar 84%
Speaker Identification Using a Combination of Different Parameters as Feature Inputs to an Artificial Neural Network Classifier
This paper presents a technique using artificial neural networks (ANNs) for speaker identification that results in a better success rate compared to other techniques. The technique used in this paper uses both power spectral densities (PSDs) and linear prediction coefficients (LPCs) as feature inputs to a self organizing feature map to achieve a better identification performance. Results for speaker identification with different methods are presented and compared
Robust speaker identification using artificial neural networks
This research mainly focuses on recognizing the speakers through their speech samples. Numerous Text-Dependent or Text-Independent algorithms have been developed by people so far, to recognize the speaker from his/her speech. In this thesis, we concentrate on the recognition of the speaker from the fixed text i.e. Text-Dependent . Possibility of extending this method to variable text i.e. Text-Independent is also analyzed. Different feature extraction algorithms are employed and their performance with Artificial Neural Networks as a Data Classifier on a fixed training set is analyzed. We find a way to combine all these individual feature extraction algorithms by incorporating their interdependence. The efficiency of these algorithms is determined after the input speech is classified using Back Propagation Algorithm of Artificial Neural Networks. A special case of Back Propagation Algorithm which improves the efficiency of the classification is also discussed
Efficient Approaches for Voice Change and Voice Conversion Systems
In this thesis, the study and design of Voice Change and Voice Conversion systems are
presented. Particularly, a voice change system manipulates a speaker’s voice to be perceived
as it is not spoken by this speaker; and voice conversion system modifies a speaker’s voice,
such that it is perceived as being spoken by a target speaker.
This thesis mainly includes two sub-parts. The first part is to develop a low latency and low
complexity voice change system (i.e. includes frequency/pitch scale modification and formant
scale modification algorithms), which can be executed on the smartphones in 2012 with very
limited computational capability. Although some low-complexity voice change algorithms
have been proposed and studied, the real-time implementations are very rare. According to the
experimental results, the proposed voice change system achieves the same quality as the
baseline approach but requires much less computational complexity and satisfies the
requirement of real-time. Moreover, the proposed system has been implemented in C
language and was released as a commercial software application. The second part of this
thesis is to investigate a novel low-complexity voice conversion system (i.e. from a source
speaker A to a target speaker B) that improves the perceptual quality and identity without
introducing large processing latencies. The proposed scheme directly manipulates the
spectrum using an effective and physically motivated method – Continuous Frequency
Warping and Magnitude Scaling (CFWMS) to guarantee high perceptual naturalness and
quality. In addition, a trajectory limitation strategy is proposed to prevent the frame-by-frame
discontinuity to further enhance the speech quality. The experimental results show that the
proposed method outperforms the conventional baseline solutions in terms of either objective
tests or subjective tests
Identification of Age Voiceprint Using Machine Learning Algorithms
The voice is considered a biometric trait since we can extract information from the speech signal that allows us to identify the person speaking in a specific recording. Fingerprints, iris, DNA, or speech can be used in biometric systems, with speech being the most intuitive, basic, and easy to create characteristic. Speech-based services are widely used in the banking and mobile sectors, although these services do not employ voice recognition to identify consumers. As a result, the possibility of using these services under a fake name is always there. To reduce the possibility of fraudulent identification, voice-based recognition systems must be designed. In this research, Mel Frequency Cepstral Coefficients (MFCC) characteristics were retrieved from the gathered voice samples to train five different machine learning algorithms, namely, the decision tree, random forest (RF), support vector machines (SVM), closest neighbor (k-NN), and multi-layer sensor (MLP). Accuracy, precision, recall, specificity, and F1 score were used as classification performance metrics to compare these algorithms. According to the findings of the study, the MLP approach had a high classification accuracy of 91%. In addition, it seems that RF performs better than other measurements. This finding demonstrates how these categorization algorithms may assist voice-based biometric systems
- …