6 research outputs found

    Model kompanzasyonlu birinci derece istatistikleri ile i-vektörlerin gürbüzlüğünün artırılması

    Get PDF
    Speaker recognition systems achieved significant improvements over the last decade, especially due to the performance of the i-vectors. Despite the achievements, mismatch between training and test data affects the recognition performance considerably. In this paper, a solution is offered to increase robustness against additive noises by inserting model compensation techniques within the i-vector extraction scheme. For stationary noises, the model compensation techniques produce highly robust systems. Parallel Model Compensation and Vector Taylor Series are considered as state-of-the-art model compensation techniques. Applying these methods to the first order statistics, a noisy total variability space training is aimed, which will reduce the mismatch resulted by additive noises. All other parts of the conventional i-vector scheme remain unchanged, such as total variability matrix training, reducing the i-vector dimensionality, scoring the i-vectors. The proposed method was tested with four different noise types with several signal to noise ratios (SNR) from -6 dB to 18 dB with 6 dB steps. High reductions in equal error rates were achieved with both methods, even at the lowest SNR levels. On average, the proposed approach produced more than 50% relative reduction in equal error rate.Konuşmacı tanıma sistemleri özellikle i-vektörlerin performansı sebebiyle son on yılda önemli gelişmeler elde etmiştir. Bu gelişmelere rağmen eğitim ve test verileri arasındaki uyumsuzluk tanıma performansını önemli ölçüde etkilemektedir. Bu çalışmada, model kompanzasyon yöntemleri i-vektör çıkarımı şemasına eklenerek toplanabilir gürültülere karşı gürbüzlüğü artıracak bir çözüm sunulmaktadır. Durağan gürültüler için model kompanzasyon teknikleri oldukça gürbüz sistemler üretir. Paralel Model Kompanzasyonu ve Vektör Taylor Serileri en gelişmiş model kompanzasyon tekniklerinden kabul edilmektedir. Bu metotlar birinci dereceden istatistiklere uygulanarak toplanabilir gürültülerden kaynaklanan uyumsuzluğu azaltacak gürültülü tüm değişkenlik uzayı eğitimi amaçlanmıştır. Tüm değişkenlik matrisin eğitimi, i-vektör boyutunun azaltılması, i-vektörlerin puanlanması gibi geleneksel i-vektör şemasının diğer tüm parçaları değişmeden kalmaktadır. Önerilen yöntem, 6 dB’lik adımlarla -6 dB’den 18 dB’ye kadar çeşitli sinyal-gürültü oranlarına (SNR) sahip dört farklı gürültü tipi ile test edilmiştir. Her iki yöntemle de en düşük SNR seviyelerinde bile eşit hata oranlarında yüksek azalmalar elde edilmiştir. Önerilen yaklaşım eşik hata oranında ortalama olarak %50’den fazla göreceli azalma sağlamıştır

    A study of speech distortion conditions in real scenarios for speech processing applications

    Get PDF
    International audienceThe growing demand for robust speech processing applications able to operate in adverse scenarios calls for new evaluation protocols and datasets beyond artificial laboratory conditions. The characteristics of real data for a given scenario are rarely discussed in the literature. As a result, methods are often tested based on the author expertise and not always in scenarios with actual practical value. This paper aims to open this discussion by identifying some of the main problems with data simulation or collection procedures used so far and summarizing the important characteristics of real scenarios to be taken into account, including the properties of reverberation, noise and Lombard effect. At last, we provide some preliminary guidelines towards designing experimental setup and speech recognition results for proposal validation

    Robust Speaker Recognition Using MAP Estimation of Additive Noise in i-vectors Space

    No full text

    Robust text independent closed set speaker identification systems and their evaluation

    Get PDF
    PhD ThesisThis thesis focuses upon text independent closed set speaker identi cation. The contributions relate to evaluation studies in the presence of various types of noise and handset e ects. Extensive evaluations are performed on four databases. The rst contribution is in the context of the use of the Gaussian Mixture Model-Universal Background Model (GMM-UBM) with original speech recordings from only the TIMIT database. Four main simulations for Speaker Identi cation Accuracy (SIA) are presented including di erent fusion strategies: Late fusion (score based), early fusion (feature based) and early-late fusion (combination of feature and score based), late fusion using concatenated static and dynamic features (features with temporal derivatives such as rst order derivative delta and second order derivative delta-delta features, namely acceleration features), and nally fusion of statistically independent normalized scores. The second contribution is again based on the GMM-UBM approach. Comprehensive evaluations of the e ect of Additive White Gaussian Noise (AWGN), and Non-Stationary Noise (NSN) (with and without a G.712 type handset) upon identi cation performance are undertaken. In particular, three NSN types with varying Signal to Noise Ratios (SNRs) were tested corresponding to: street tra c, a bus interior and a crowded talking environment. The performance evaluation also considered the e ect of late fusion techniques based on score fusion, namely mean, maximum, and linear weighted sum fusion. The databases employed were: TIMIT, SITW, and NIST 2008; and 120 speakers were selected from each database to yield 3,600 speech utterances. The third contribution is based on the use of the I-vector, four combinations of I-vectors with 100 and 200 dimensions were employed. Then, various fusion techniques using maximum, mean, weighted sum and cumulative fusion with the same I-vector dimension were used to improve the SIA. Similarly, both interleaving and concatenated I-vector fusion were exploited to produce 200 and 400 I-vector dimensions. The system was evaluated with four di erent databases using 120 speakers from each database. TIMIT, SITW and NIST 2008 databases were evaluated for various types of NSN namely, street-tra c NSN, bus-interior NSN and crowd talking NSN; and the G.712 type handset at 16 kHz was also applied. As recommendations from the study in terms of the GMM-UBM approach, mean fusion is found to yield overall best performance in terms of the SIA with noisy speech, whereas linear weighted sum fusion is overall best for original database recordings. However, in the I-vector approach the best SIA was obtained from the weighted sum and the concatenated fusion.Ministry of Higher Education and Scienti c Research (MoHESR), and the Iraqi Cultural Attach e, Al-Mustansiriya University, Al-Mustansiriya University College of Engineering in Iraq for supporting my PhD scholarship

    Robust speaker recognition in presence of non-trivial environmental noise (toward greater biometric security)

    Get PDF
    The aim of this thesis is to investigate speaker recognition in the presence of environmental noise, and to develop a robust speaker recognition method. Recently, Speaker Recognition has been the object of considerable research due to its wide use in various areas. Despite major developments in this field, there are still many limitations and challenges. Environmental noises and their variations are high up in the list of challenges since it impossible to provide a noise free environment. A novel approach is proposed to address the issue of performance degradation in environmental noise. This approach is based on the estimation of signal-to-noise ratio (SNR) and detection of ambient noise from the recognition signal to re-train the reference model for the claimed speaker and to generate a new adapted noisy model to decrease the noise mismatch with recognition utterances. This approach is termed “Training on the fly” for robustness of speaker recognition under noisy environments. To detect the noise in the recognition signal two different techniques are proposed: the first technique including generating an emulated noise depending on estimated power spectrum of the original noise using 1/3 octave band filter bank and white noise signal. This emulated noise become close enough to original one that includes in the input signal (recognition signal). The second technique deals with extracting the noise from the input signal using one of speech enhancement algorithm with spectral subtraction to find the noise in the signal. Training on the fly approach (using both techniques) has been examined using two feature approaches and two different kinds of artificial clean and noisy speech databases collected in different environments. Furthermore, the speech samples were text independent. The training on the fly approach is a significant improvement in performance when compared with the performance of conventional speaker recognition (based on clean reference models). Moreover, the training on the fly based on noise extraction showed the best results for all types of noisy data
    corecore