778 research outputs found

    Adversarial Network Bottleneck Features for Noise Robust Speaker Verification

    Full text link
    In this paper, we propose a noise robust bottleneck feature representation which is generated by an adversarial network (AN). The AN includes two cascade connected networks, an encoding network (EN) and a discriminative network (DN). Mel-frequency cepstral coefficients (MFCCs) of clean and noisy speech are used as input to the EN and the output of the EN is used as the noise robust feature. The EN and DN are trained in turn, namely, when training the DN, noise types are selected as the training labels and when training the EN, all labels are set as the same, i.e., the clean speech label, which aims to make the AN features invariant to noise and thus achieve noise robustness. We evaluate the performance of the proposed feature on a Gaussian Mixture Model-Universal Background Model based speaker verification system, and make comparison to MFCC features of speech enhanced by short-time spectral amplitude minimum mean square error (STSA-MMSE) and deep neural network-based speech enhancement (DNN-SE) methods. Experimental results on the RSR2015 database show that the proposed AN bottleneck feature (AN-BN) dramatically outperforms the STSA-MMSE and DNN-SE based MFCCs for different noise types and signal-to-noise ratios. Furthermore, the AN-BN feature is able to improve the speaker verification performance under the clean condition

    Model kompanzasyonlu birinci derece istatistikleri ile i-vektörlerin gürbüzlüğünün artırılması

    Get PDF
    Speaker recognition systems achieved significant improvements over the last decade, especially due to the performance of the i-vectors. Despite the achievements, mismatch between training and test data affects the recognition performance considerably. In this paper, a solution is offered to increase robustness against additive noises by inserting model compensation techniques within the i-vector extraction scheme. For stationary noises, the model compensation techniques produce highly robust systems. Parallel Model Compensation and Vector Taylor Series are considered as state-of-the-art model compensation techniques. Applying these methods to the first order statistics, a noisy total variability space training is aimed, which will reduce the mismatch resulted by additive noises. All other parts of the conventional i-vector scheme remain unchanged, such as total variability matrix training, reducing the i-vector dimensionality, scoring the i-vectors. The proposed method was tested with four different noise types with several signal to noise ratios (SNR) from -6 dB to 18 dB with 6 dB steps. High reductions in equal error rates were achieved with both methods, even at the lowest SNR levels. On average, the proposed approach produced more than 50% relative reduction in equal error rate.Konuşmacı tanıma sistemleri özellikle i-vektörlerin performansı sebebiyle son on yılda önemli gelişmeler elde etmiştir. Bu gelişmelere rağmen eğitim ve test verileri arasındaki uyumsuzluk tanıma performansını önemli ölçüde etkilemektedir. Bu çalışmada, model kompanzasyon yöntemleri i-vektör çıkarımı şemasına eklenerek toplanabilir gürültülere karşı gürbüzlüğü artıracak bir çözüm sunulmaktadır. Durağan gürültüler için model kompanzasyon teknikleri oldukça gürbüz sistemler üretir. Paralel Model Kompanzasyonu ve Vektör Taylor Serileri en gelişmiş model kompanzasyon tekniklerinden kabul edilmektedir. Bu metotlar birinci dereceden istatistiklere uygulanarak toplanabilir gürültülerden kaynaklanan uyumsuzluğu azaltacak gürültülü tüm değişkenlik uzayı eğitimi amaçlanmıştır. Tüm değişkenlik matrisin eğitimi, i-vektör boyutunun azaltılması, i-vektörlerin puanlanması gibi geleneksel i-vektör şemasının diğer tüm parçaları değişmeden kalmaktadır. Önerilen yöntem, 6 dB’lik adımlarla -6 dB’den 18 dB’ye kadar çeşitli sinyal-gürültü oranlarına (SNR) sahip dört farklı gürültü tipi ile test edilmiştir. Her iki yöntemle de en düşük SNR seviyelerinde bile eşit hata oranlarında yüksek azalmalar elde edilmiştir. Önerilen yaklaşım eşik hata oranında ortalama olarak %50’den fazla göreceli azalma sağlamıştır

    ROBUST SPEAKER RECOGNITION BASED ON LATENT VARIABLE MODELS

    Get PDF
    Automatic speaker recognition in uncontrolled environments is a very challenging task due to channel distortions, additive noise and reverberation. To address these issues, this thesis studies probabilistic latent variable models of short-term spectral information that leverage large amounts of data to achieve robustness in challenging conditions. Current speaker recognition systems represent an entire speech utterance as a single point in a high-dimensional space. This representation is known as "supervector". This thesis starts by analyzing the properties of this representation. A novel visualization procedure of supervectors is presented by which qualitative insight about the information being captured is obtained. We then propose the use of an overcomplete dictionary to explicitly decompose a supervector into a speaker-specific component and an undesired variability component. An algorithm to learn the dictionary from a large collection of data is discussed and analyzed. A subset of the entries of the dictionary is learned to represent speaker-specific information and another subset to represent distortions. After encoding the supervector as a linear combination of the dictionary entries, the undesired variability is removed by discarding the contribution of the distortion components. This paradigm is closely related to the previously proposed paradigm of Joint Factor Analysis modeling of supervectors. We establish a connection between the two approaches and show how our proposed method provides improvements in terms of computation and recognition accuracy. An alternative way to handle undesired variability in supervector representations is to first project them into a lower dimensional space and then to model them in the reduced subspace. This low-dimensional projection is known as "i-vector". Unfortunately, i-vectors exhibit non-Gaussian behavior, and direct statistical modeling requires the use of heavy-tailed distributions for optimal performance. These approaches lack closed-form solutions, and therefore are hard to analyze. Moreover, they do not scale well to large datasets. Instead of directly modeling i-vectors, we propose to first apply a non-linear transformation and then use a linear-Gaussian model. We present two alternative transformations and show experimentally that the transformed i-vectors can be optimally modeled by a simple linear-Gaussian model (factor analysis). We evaluate our method on a benchmark dataset with a large amount of channel variability and show that the results compare favorably against the competitors. Also, our approach has closed-form solutions and scales gracefully to large datasets. Finally, a multi-classifier architecture trained on a multicondition fashion is proposed to address the problem of speaker recognition in the presence of additive noise. A large number of experiments are conducted to analyze the proposed architecture and to obtain guidelines for optimal performance in noisy environments. Overall, it is shown that multicondition training of multi-classifier architectures not only produces great robustness in the anticipated conditions, but also generalizes well to unseen conditions

    Pairwise Discriminative Speaker Verification in the I-Vector Space

    Get PDF
    This work presents a new and efficient approach to discriminative speaker verification in the i-vector space. We illustrate the development of a linear discriminative classifier that is trained to discriminate between the hypothesis that a pair of feature vectors in a trial belong to the same speaker or to different speakers. This approach is alternative to the usual discriminative setup that discriminates between a speaker and all the other speakers. We use a discriminative classifier based on a Support Vector Machine (SVM) that is trained to estimate the parameters of a symmetric quadratic function approximating a log-likelihood ratio score without explicit modeling of the i-vector distributions as in the generative Probabilistic Linear Discriminant Analysis (PLDA) models. Training these models is feasible because it is not necessary to expand the i-vector pairs, which would be expensive or even impossible even for medium sized training sets. The results of experiments performed on the tel-tel extended core condition of the NIST 2010 Speaker Recognition Evaluation are competitive with the ones obtained by generative models, in terms of normalized Detection Cost Function and Equal Error Rate. Moreover, we show that it is possible to train a gender- independent discriminative model that achieves state-of-the-art accuracy, comparable to the one of a gender-dependent system, saving memory and execution time both in training and in testin

    Open-set Speaker Identification

    Get PDF
    This study is motivated by the growing need for effective extraction of intelligence and evidence from audio recordings in the fight against crime, a need made ever more apparent with the recent expansion of criminal and terrorist organisations. The main focus is to enhance open-set speaker identification process within the speaker identification systems, which are affected by noisy audio data obtained under uncontrolled environments such as in the street, in restaurants or other places of businesses. Consequently, two investigations are initially carried out including the effects of environmental noise on the accuracy of open-set speaker recognition, which thoroughly cover relevant conditions in the considered application areas, such as variable training data length, background noise and real world noise, and the effects of short and varied duration reference data in open-set speaker recognition. The investigations led to a novel method termed “vowel boosting” to enhance the reliability in speaker identification when operating with varied duration speech data under uncontrolled conditions. Vowels naturally contain more speaker specific information. Therefore, by emphasising this natural phenomenon in speech data, it enables better identification performance. The traditional state-of-the-art GMM-UBMs and i-vectors are used to evaluate “vowel boosting”. The proposed approach boosts the impact of the vowels on the speaker scores, which improves the recognition accuracy for the specific case of open-set identification with short and varied duration of speech material

    Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

    Get PDF
    Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition that stills remains an important challenge. Data-driven supervised approaches, including ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks

    Scoring heterogeneous speaker vectors using nonlinear transformations and tied PLDa models

    Get PDF
    Most current state-of-the-art text-independent speaker recognition systems are based on i-vectors, and on probabilistic linear discriminant analysis (PLDA). PLDA assumes that the i-vectors of a trial are homogeneous, i.e., that they have been extracted by the same system. In other words, the enrollment and test i-vectors belong to the same class. However, it is sometimes important to score trials including “heterogeneous” i-vectors, for instance, enrollment i-vectors extracted by an old system, and test i-vectors extracted by a newer, more accurate, system. In this paper, we introduce a PLDA model that is able to score heterogeneous i-vectors independent of their extraction approach, dimensions, and any other characteristics that make a set of i-vectors of the same speaker belong to different classes. The new model, which will be referred to as nonlinear tied-PLDA (NL-Tied-PLDA), is obtained by a generalization of our recently proposed nonlinear PLDA approach, which jointly estimates the PLDA parameters and the parameters of a nonlinear transformation of the i-vectors. The generalization consists of estimating a class-dependent nonlinear transformation of the i-vectors, with the constraint that the transformed i-vectors of the same speaker share the same speaker factor. The resulting model is flexible and accurate, as assessed by the results of a set of experiments performed on the extended core NIST SRE 2012 evaluation. In particular, NL-Tied-PLDA provides better results on heterogeneous trials with respect to the corresponding homogeneous trials scored by the old system, and, in some configurations, it also reaches the accuracy of the new system. Similar results were obtained on the female-extended core NIST SRE 2010 telephone condition
    corecore