22 research outputs found
Vocal Tract Length Perturbation for Text-Dependent Speaker Verification with Autoregressive Prediction Coding
In this letter, we propose a vocal tract length (VTL) perturbation method for
text-dependent speaker verification (TD-SV), in which a set of TD-SV systems
are trained, one for each VTL factor, and score-level fusion is applied to make
a final decision. Next, we explore the bottleneck (BN) feature extracted by
training deep neural networks with a self-supervised objective, autoregressive
predictive coding (APC), for TD-SV and compare it with the well-studied
speaker-discriminant BN feature. The proposed VTL method is then applied to APC
and speaker-discriminant BN features. In the end, we combine the VTL
perturbation systems trained on MFCC and the two BN features in the score
domain. Experiments are performed on the RedDots challenge 2016 database of
TD-SV using short utterances with Gaussian mixture model-universal background
model and i-vector techniques. Results show the proposed methods significantly
outperform the baselines.Comment: Copyright (c) 2021 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other work
Additive Margin SincNet for Speaker Recognition
Speaker Recognition is a challenging task with essential applications such as
authentication, automation, and security. The SincNet is a new deep learning
based model which has produced promising results to tackle the mentioned task.
To train deep learning systems, the loss function is essential to the network
performance. The Softmax loss function is a widely used function in deep
learning methods, but it is not the best choice for all kind of problems. For
distance-based problems, one new Softmax based loss function called Additive
Margin Softmax (AM-Softmax) is proving to be a better choice than the
traditional Softmax. The AM-Softmax introduces a margin of separation between
the classes that forces the samples from the same class to be closer to each
other and also maximizes the distance between classes. In this paper, we
propose a new approach for speaker recognition systems called AM-SincNet, which
is based on the SincNet but uses an improved AM-Softmax layer. The proposed
method is evaluated in the TIMIT dataset and obtained an improvement of
approximately 40% in the Frame Error Rate compared to SincNet