5 research outputs found
Investigation of Frame Alignments for GMM-based Digit-prompted Speaker Verification
Frame alignments can be computed by different methods in GMM-based speaker
verification. By incorporating a phonetic Gaussian mixture model (PGMM), we are
able to compare the performance using alignments extracted from the deep neural
networks (DNN) and the conventional hidden Markov model (HMM) in digit-prompted
speaker verification. Based on the different characteristics of these two
alignments, we present a novel content verification method to improve the
system security without much computational overhead. Our experiments on the
RSR2015 Part-3 digit-prompted task show that, the DNN based alignment performs
on par with the HMM alignment. The results also demonstrate the effectiveness
of the proposed Kullback-Leibler (KL) divergence based scoring to reject speech
with incorrect pass-phrases.Comment: accepted by APSIPA ASC 201
Phonetic aware techniques for Speaker Verification
The goal of this thesis is to improve current state-of-the-art techniques in speaker verification
(SV), typically based on âidentity-vectorsâ (i-vectors) and deep neural network (DNN), by exploiting diverse (phonetic) information extracted using various techniques such as automatic
speech recognition (ASR). Different speakers span different subspaces within a universal acoustic space, usually modelled by âuniversal background modelâ. The speaker-specific subspace
depends on the speakerâs voice characteristics, but also on the verbalised text of a speaker. In current state-of-the-art SV systems, i-vectors are extracted by applying a factor analysis
technique to obtain low dimensional speaker-specific representation. Furthermore, DNN output is also employed in a conventional i-vector framework to model phonetic information
embedded in the speech signal. This thesis proposes various techniques to exploit phonetic knowledge of speech to further enrich speaker characteristics.
More specifically, the techniques proposed in this thesis are applied to various SV tasks,
namely, text-independent and text-dependent SV. For text-independent SV task, several ASR
systems are developed and applied to compute phonetic posterior probabilities, subsequently
exploited to enhance the speaker-specific information included in i-vectors. These approaches
are then extended for text-dependent SV task, exploiting temporal information in a principled
way, i.e., by using dynamic time warping applied on speaker informative vectors.
Finally, as opposed to train DNN with phonetic information, DNN is trained in an end-to-end
fashion to directly discriminate speakers. The baseline end-to-end SV approach consists of
mapping a variable length speech segment to a fixed dimensional speaker vector by estimating
the mean of hidden representations in DNN structure. We improve upon this technique by
computing a distance function between two utterances which takes into account common
phonetic units. The whole network is optimized by employing a triplet-loss objective function.
The proposed approaches are evaluated on commonly used datasets such as NIST SRE 2010
and RSR2015. Significant improvements are observed over the baseline systems on both the
text-dependent and text-independent SV tasks by applying phonetic knowledge