2 research outputs found
Model kompanzasyonlu birinci derece istatistikleri ile i-vektörlerin gürbüzlüğünün artırılması
Speaker recognition systems achieved significant improvements over the last decade, especially due to
the performance of the i-vectors. Despite the achievements, mismatch between training and test data
affects the recognition performance considerably. In this paper, a solution is offered to increase
robustness against additive noises by inserting model compensation techniques within the i-vector
extraction scheme. For stationary noises, the model compensation techniques produce highly robust
systems. Parallel Model Compensation and Vector Taylor Series are considered as state-of-the-art
model compensation techniques. Applying these methods to the first order statistics, a noisy total
variability space training is aimed, which will reduce the mismatch resulted by additive noises. All other
parts of the conventional i-vector scheme remain unchanged, such as total variability matrix training,
reducing the i-vector dimensionality, scoring the i-vectors. The proposed method was tested with four
different noise types with several signal to noise ratios (SNR) from -6 dB to 18 dB with 6 dB steps. High
reductions in equal error rates were achieved with both methods, even at the lowest SNR levels. On
average, the proposed approach produced more than 50% relative reduction in equal error rate.Konuşmacı tanıma sistemleri özellikle i-vektörlerin performansı sebebiyle son on yılda önemli
gelişmeler elde etmiştir. Bu gelişmelere rağmen eğitim ve test verileri arasındaki uyumsuzluk tanıma
performansını önemli ölçüde etkilemektedir. Bu çalışmada, model kompanzasyon yöntemleri i-vektör
çıkarımı şemasına eklenerek toplanabilir gürültülere karşı gürbüzlüğü artıracak bir çözüm
sunulmaktadır. Durağan gürültüler için model kompanzasyon teknikleri oldukça gürbüz sistemler üretir.
Paralel Model Kompanzasyonu ve Vektör Taylor Serileri en gelişmiş model kompanzasyon
tekniklerinden kabul edilmektedir. Bu metotlar birinci dereceden istatistiklere uygulanarak toplanabilir
gürültülerden kaynaklanan uyumsuzluğu azaltacak gürültülü tüm değişkenlik uzayı eğitimi
amaçlanmıştır. Tüm değişkenlik matrisin eğitimi, i-vektör boyutunun azaltılması, i-vektörlerin
puanlanması gibi geleneksel i-vektör şemasının diğer tüm parçaları değişmeden kalmaktadır. Önerilen
yöntem, 6 dB’lik adımlarla -6 dB’den 18 dB’ye kadar çeşitli sinyal-gürültü oranlarına (SNR) sahip dört
farklı gürültü tipi ile test edilmiştir. Her iki yöntemle de en düşük SNR seviyelerinde bile eşit hata
oranlarında yüksek azalmalar elde edilmiştir. Önerilen yaklaşım eşik hata oranında ortalama olarak
%50’den fazla göreceli azalma sağlamıştır
Robust text independent closed set speaker identification systems and their evaluation
PhD ThesisThis thesis focuses upon text independent closed set speaker
identi cation. The contributions relate to evaluation studies in the
presence of various types of noise and handset e ects. Extensive
evaluations are performed on four databases.
The rst contribution is in the context of the use of the Gaussian
Mixture Model-Universal Background Model (GMM-UBM) with
original speech recordings from only the TIMIT database. Four main
simulations for Speaker Identi cation Accuracy (SIA) are presented
including di erent fusion strategies: Late fusion (score based), early
fusion (feature based) and early-late fusion (combination of feature and
score based), late fusion using concatenated static and dynamic
features (features with temporal derivatives such as rst order
derivative delta and second order derivative delta-delta features,
namely acceleration features), and nally fusion of statistically
independent normalized scores.
The second contribution is again based on the GMM-UBM
approach. Comprehensive evaluations of the e ect of Additive White
Gaussian Noise (AWGN), and Non-Stationary Noise (NSN) (with and
without a G.712 type handset) upon identi cation performance are
undertaken. In particular, three NSN types with varying Signal to
Noise Ratios (SNRs) were tested corresponding to: street tra c, a bus
interior and a crowded talking environment. The performance
evaluation also considered the e ect of late fusion techniques based on
score fusion, namely mean, maximum, and linear weighted sum fusion.
The databases employed were: TIMIT, SITW, and NIST 2008; and 120
speakers were selected from each database to yield 3,600 speech
utterances.
The third contribution is based on the use of the I-vector, four
combinations of I-vectors with 100 and 200 dimensions were employed.
Then, various fusion techniques using maximum, mean, weighted sum
and cumulative fusion with the same I-vector dimension were used to
improve the SIA. Similarly, both interleaving and concatenated I-vector
fusion were exploited to produce 200 and 400 I-vector dimensions. The
system was evaluated with four di erent databases using 120 speakers
from each database. TIMIT, SITW and NIST 2008 databases were
evaluated for various types of NSN namely, street-tra c NSN,
bus-interior NSN and crowd talking NSN; and the G.712 type handset
at 16 kHz was also applied.
As recommendations from the study in terms of the GMM-UBM
approach, mean fusion is found to yield overall best performance in terms
of the SIA with noisy speech, whereas linear weighted sum fusion is
overall best for original database recordings. However, in the I-vector
approach the best SIA was obtained from the weighted sum and the
concatenated fusion.Ministry of Higher Education
and Scienti c Research (MoHESR), and the Iraqi Cultural Attach e,
Al-Mustansiriya University, Al-Mustansiriya University College of
Engineering in Iraq for supporting my PhD scholarship