918 research outputs found

    Joint Bayesian Gaussian discriminant analysis for speaker verification

    Full text link
    State-of-the-art i-vector based speaker verification relies on variants of Probabilistic Linear Discriminant Analysis (PLDA) for discriminant analysis. We are mainly motivated by the recent work of the joint Bayesian (JB) method, which is originally proposed for discriminant analysis in face verification. We apply JB to speaker verification and make three contributions beyond the original JB. 1) In contrast to the EM iterations with approximated statistics in the original JB, the EM iterations with exact statistics are employed and give better performance. 2) We propose to do simultaneous diagonalization (SD) of the within-class and between-class covariance matrices to achieve efficient testing, which has broader application scope than the SVD-based efficient testing method in the original JB. 3) We scrutinize similarities and differences between various Gaussian PLDAs and JB, complementing the previous analysis of comparing JB only with Prince-Elder PLDA. Extensive experiments are conducted on NIST SRE10 core condition 5, empirically validating the superiority of JB with faster convergence rate and 9-13% EER reduction compared with state-of-the-art PLDA.Comment: accepted by ICASSP201

    Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System

    Full text link
    In this paper, we explore the encoding/pooling layer and loss function in the end-to-end speaker and language recognition system. First, a unified and interpretable end-to-end system for both speaker and language recognition is developed. It accepts variable-length input and produces an utterance level result. In the end-to-end system, the encoding layer plays a role in aggregating the variable-length input sequence into an utterance level representation. Besides the basic temporal average pooling, we introduce a self-attentive pooling layer and a learnable dictionary encoding layer to get the utterance level representation. In terms of loss function for open-set speaker verification, to get more discriminative speaker embedding, center loss and angular softmax loss is introduced in the end-to-end system. Experimental results on Voxceleb and NIST LRE 07 datasets show that the performance of end-to-end learning system could be significantly improved by the proposed encoding layer and loss function.Comment: Accepted for Speaker Odyssey 201

    Tackling Age-Invariant Face Recognition with Non-Linear PLDA and Pairwise SVM

    Get PDF
    Face recognition approaches, especially those based on deep learning models, are becoming increasingly attractive for missing person identification, due to their effectiveness and the relative simplicity of obtaining information available for comparison. However, these methods still suffer from large accuracy drops when they have to tackle cross-age recognition, which is the most common condition to face in this specific task. To address these challenges, in this paper we investigate the contribution of different generative and discriminative models that extend the Probabilistic Linear Discriminant Analysis (PLDA) approach. These models aim at disentangling identity from other facial variations (including those due to age effects). As such, they can improve the age invariance characteristics of state-of-the-art deep facial embeddings. In this work, we experiment with a standard PLDA, a non-linear version of PLDA, the Pairwise Support Vector Machine (PSVM), and introduce a nonlinear version of PSVM (NL--PSVM) as a novelty. We thoroughly analyze the proposed models' performance when addressing cross-age recognition in a large and challenging experimental dataset containing around 2.5 million images of 790,000 individuals. Results on this testbed confirm the challenges in age invariant face recognition, showing significant differences in the effects of aging across embedding models, genders, age ranges, and age gaps. Our experiments show as well the effectiveness of both PLDA and its proposed extensions in reducing the age sensitivity of the facial features, especially when there are significant age differences (more than ten years) between the compared images or when age-related facial changes are more pronounced, such as during the transition from childhood to adolescence or from adolescence to adulthood. Further experiments on three standard cross-age benchmarks (MORPH2, CACD-VS and FG-NET) confirm the proposed models' effectiveness

    Graph Neural Network Backend for Speaker Recognition

    Full text link
    Currently, most speaker recognition backends, such as cosine, linear discriminant analysis (LDA), or probabilistic linear discriminant analysis (PLDA), make decisions by calculating similarity or distance between enrollment and test embeddings which are already extracted from neural networks. However, for each embedding, the local structure of itself and its neighbor embeddings in the low-dimensional space is different, which may be helpful for the recognition but is often ignored. In order to take advantage of it, we propose a graph neural network (GNN) backend to mine latent relationships among embeddings for classification. We assume all the embeddings as nodes on a graph, and their edges are computed based on some similarity function, such as cosine, LDA+cosine, or LDA+PLDA. We study different graph settings and explore variants of GNN to find a better message passing and aggregation way to accomplish the recognition task. Experimental results on NIST SRE14 i-vector challenging, VoxCeleb1-O, VoxCeleb1-E, and VoxCeleb1-H datasets demonstrate that our proposed GNN backends significantly outperform current mainstream methods
    • …
    corecore