834 research outputs found
Investigation of Frame Alignments for GMM-based Digit-prompted Speaker Verification
Frame alignments can be computed by different methods in GMM-based speaker
verification. By incorporating a phonetic Gaussian mixture model (PGMM), we are
able to compare the performance using alignments extracted from the deep neural
networks (DNN) and the conventional hidden Markov model (HMM) in digit-prompted
speaker verification. Based on the different characteristics of these two
alignments, we present a novel content verification method to improve the
system security without much computational overhead. Our experiments on the
RSR2015 Part-3 digit-prompted task show that, the DNN based alignment performs
on par with the HMM alignment. The results also demonstrate the effectiveness
of the proposed Kullback-Leibler (KL) divergence based scoring to reject speech
with incorrect pass-phrases.Comment: accepted by APSIPA ASC 201
Speaker recognition using frequency filtered spectral energies
The spectral parameters that result from filtering the
frequency sequence of log mel-scaled filter-bank energies
with a simple first or second order FIR filter have proved
to be an efficient speech representation in terms of both
speech recognition rate and computational load. Recently,
the authors have shown that this frequency filtering can
approximately equalize the cepstrum variance enhancing
the oscillations of the spectral envelope curve that are
most effective for discrimination between speakers. Even
better speaker identification results than using melcepstrum
have been obtained on the TIMIT database,
especially when white noise was added. On the other
hand, the hybridization of both linear prediction and
filter-bank spectral analysis using either cepstral
transformation or the alternative frequency filtering has
been explored for speaker verification. The combination
of hybrid spectral analysis and frequency filtering, that
had shown to be able to outperform the conventional
techniques in clean and noisy word recognition, has yield
good text-dependent speaker verification results on the
new speaker-oriented telephone-line POLYCOST
database.Peer ReviewedPostprint (published version
Phoneme and Sub-Phoneme T-Normalization for Text-Dependent Speaker Recognition
Test normalization (T-Norm) is a score normalization technique that is regularly and successfully applied in the context of text-independent speaker recognition. It is less frequently applied, however, to text-dependent or textprompted speaker recognition, mainly because its improvement in this context is more modest. In this paper we present a novel way to improve the performance of T-Norm for text-dependent systems. It consists in applying score TNormalization at the phoneme or sub-phoneme level instead of at the sentence level. Experiments on the YOHO corpus show that, while using standard sentence-level T-Norm does not improve equal error rate (EER), phoneme and sub-phoneme level T-Norm produce a relative EER reduction of 18.9% and 20.1% respectively on a state-of-the-art HMM based textdependent speaker recognition system. Results are even better for working points with low false acceptance rates
Speaker recognition by means of restricted Boltzmann machine adaptation
Restricted Boltzmann Machines (RBMs) have shown success in speaker recognition. In this paper, RBMs are investigated in a framework comprising a universal model training and model adaptation. Taking advantage of RBM unsupervised learning algorithm, a global model is trained based on all available background data. This general speaker-independent model, referred to as URBM, is further adapted to the data of a specific speaker to build speaker-dependent model. In order to show its effectiveness, we have applied this framework to two different tasks. It has been used to discriminatively model target and impostor spectral features for classification. It has been also utilized to produce a vector-based representation for speakers. This vector-based representation, similar to i-vector, can be further used for speaker recognition using either cosine scoring or Probabilistic Linear Discriminant Analysis (PLDA). The evaluation is performed on the core test condition of the NIST SRE 2006 database.Peer ReviewedPostprint (author's final draft
- …