5,910 research outputs found
Investigation of Frame Alignments for GMM-based Digit-prompted Speaker Verification
Frame alignments can be computed by different methods in GMM-based speaker
verification. By incorporating a phonetic Gaussian mixture model (PGMM), we are
able to compare the performance using alignments extracted from the deep neural
networks (DNN) and the conventional hidden Markov model (HMM) in digit-prompted
speaker verification. Based on the different characteristics of these two
alignments, we present a novel content verification method to improve the
system security without much computational overhead. Our experiments on the
RSR2015 Part-3 digit-prompted task show that, the DNN based alignment performs
on par with the HMM alignment. The results also demonstrate the effectiveness
of the proposed Kullback-Leibler (KL) divergence based scoring to reject speech
with incorrect pass-phrases.Comment: accepted by APSIPA ASC 201
Prosodic-Enhanced Siamese Convolutional Neural Networks for Cross-Device Text-Independent Speaker Verification
In this paper a novel cross-device text-independent speaker verification
architecture is proposed. Majority of the state-of-the-art deep architectures
that are used for speaker verification tasks consider Mel-frequency cepstral
coefficients. In contrast, our proposed Siamese convolutional neural network
architecture uses Mel-frequency spectrogram coefficients to benefit from the
dependency of the adjacent spectro-temporal features. Moreover, although
spectro-temporal features have proved to be highly reliable in speaker
verification models, they only represent some aspects of short-term acoustic
level traits of the speaker's voice. However, the human voice consists of
several linguistic levels such as acoustic, lexicon, prosody, and phonetics,
that can be utilized in speaker verification models. To compensate for these
inherited shortcomings in spectro-temporal features, we propose to enhance the
proposed Siamese convolutional neural network architecture by deploying a
multilayer perceptron network to incorporate the prosodic, jitter, and shimmer
features. The proposed end-to-end verification architecture performs feature
extraction and verification simultaneously. This proposed architecture displays
significant improvement over classical signal processing approaches and deep
algorithms for forensic cross-device speaker verification.Comment: Accepted in 9th IEEE International Conference on Biometrics: Theory,
Applications, and Systems (BTAS 2018
A Double Joint Bayesian Approach for J-Vector Based Text-dependent Speaker Verification
J-vector has been proved to be very effective in text-dependent speaker
verification with short-duration speech. However, the current state-of-the-art
back-end classifiers, e.g. joint Bayesian model, cannot make full use of such
deep features. In this paper, we generalize the standard joint Bayesian
approach to model the multi-faceted information in the j-vector explicitly and
jointly. In our generalization, the j-vector was modeled as a result derived by
a generative Double Joint Bayesian (DoJoBa) model, which contains several kinds
of latent variables. With DoJoBa, we are able to explicitly build a model that
can combine multiple heterogeneous information from the j-vectors. In
verification step, we calculated the likelihood to describe whether the two
j-vectors having consistent labels or not. On the public RSR2015 data corpus,
the experimental results showed that our approach can achieve 0.02\% EER and
0.02\% EER for impostor wrong and impostor correct cases respectively
Speaker verification using sequence discriminant support vector machines
This paper presents a text-independent speaker verification system using support vector machines (SVMs) with score-space kernels. Score-space kernels generalize Fisher kernels and are based on underlying generative models such as Gaussian mixture models (GMMs). This approach provides direct discrimination between whole sequences, in contrast with the frame-level approaches at the heart of most current systems. The resultant SVMs have a very high dimensionality since it is related to the number of parameters in the underlying generative model. To address problems that arise in the resultant optimization we introduce a technique called spherical normalization that preconditions the Hessian matrix. We have performed speaker verification experiments using the PolyVar database. The SVM system presented here reduces the relative error rates by 34% compared to a GMM likelihood ratio system
- ā¦