1,649 research outputs found
Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System
In this paper, we explore the encoding/pooling layer and loss function in the
end-to-end speaker and language recognition system. First, a unified and
interpretable end-to-end system for both speaker and language recognition is
developed. It accepts variable-length input and produces an utterance level
result. In the end-to-end system, the encoding layer plays a role in
aggregating the variable-length input sequence into an utterance level
representation. Besides the basic temporal average pooling, we introduce a
self-attentive pooling layer and a learnable dictionary encoding layer to get
the utterance level representation. In terms of loss function for open-set
speaker verification, to get more discriminative speaker embedding, center loss
and angular softmax loss is introduced in the end-to-end system. Experimental
results on Voxceleb and NIST LRE 07 datasets show that the performance of
end-to-end learning system could be significantly improved by the proposed
encoding layer and loss function.Comment: Accepted for Speaker Odyssey 201
Multimodal Fusion of Polynomial Classifiers for Automatic Person Recognition
With the prevalence of the information age, privacy and personalization are forefront in today\u27s society. As such, biometrics are viewed as essential components of current and evolving technological systems. Consumers demand unobtrusive and noninvasive approaches. In our previous work, we have demonstrated a speaker verification system that meets these criteria. However, there are additional constraints for fielded systems. The required recognition transactions are often performed in adverse environments and across diverse populations, necessitating robust solutions.
There are two significant problem areas in current generation speaker verification systems. The first is the difficulty in acquiring clean audio signals (in all environments) without encumbering the user with a head-mounted close-talking microphone. Second, unimodal biometric systems do not work with a significant percentage of the population. To combat these issues, multimodal techniques are being investigated to improve system robustness to environmental conditions, as well as improve overall accuracy across the population.
We propose a multimodal approach that builds on our current state-of-the-art speaker verification technology. In order to maintain the transparent nature of the speech interface, we focus on optical sensing technology to provide the additional modality–giving us an audio-visual person recognition system. For the audio domain, we use our existing speaker verification system. For the visual domain, we focus on lip motion. This is chosen, rather than static face or iris recognition, because it provides dynamic information about the individual. In addition, the lip dynamics can aid speech recognition to provide liveness testing.
The visual processing method makes use of both color and edge information, combined within a Markov random field (MRF) framework, to localize the lips. Geometric features are extracted and input to a polynomial classifier for the person recognition process. A late integration approach, based on a probabilistic model, is employed to combine the two modalities. The system is tested on the XM2VTS database combined with AWGN (in the audio domain) over a range of signal-to-noise ratios
Testing for Homogeneity with Kernel Fisher Discriminant Analysis
We propose to investigate test statistics for testing homogeneity in
reproducing kernel Hilbert spaces. Asymptotic null distributions under null
hypothesis are derived, and consistency against fixed and local alternatives is
assessed. Finally, experimental evidence of the performance of the proposed
approach on both artificial data and a speaker verification task is provided
Graph Neural Network Backend for Speaker Recognition
Currently, most speaker recognition backends, such as cosine, linear
discriminant analysis (LDA), or probabilistic linear discriminant analysis
(PLDA), make decisions by calculating similarity or distance between enrollment
and test embeddings which are already extracted from neural networks. However,
for each embedding, the local structure of itself and its neighbor embeddings
in the low-dimensional space is different, which may be helpful for the
recognition but is often ignored. In order to take advantage of it, we propose
a graph neural network (GNN) backend to mine latent relationships among
embeddings for classification. We assume all the embeddings as nodes on a
graph, and their edges are computed based on some similarity function, such as
cosine, LDA+cosine, or LDA+PLDA. We study different graph settings and explore
variants of GNN to find a better message passing and aggregation way to
accomplish the recognition task. Experimental results on NIST SRE14 i-vector
challenging, VoxCeleb1-O, VoxCeleb1-E, and VoxCeleb1-H datasets demonstrate
that our proposed GNN backends significantly outperform current mainstream
methods
- …