92 research outputs found
Additive Margin SincNet for Speaker Recognition
Speaker Recognition is a challenging task with essential applications such as
authentication, automation, and security. The SincNet is a new deep learning
based model which has produced promising results to tackle the mentioned task.
To train deep learning systems, the loss function is essential to the network
performance. The Softmax loss function is a widely used function in deep
learning methods, but it is not the best choice for all kind of problems. For
distance-based problems, one new Softmax based loss function called Additive
Margin Softmax (AM-Softmax) is proving to be a better choice than the
traditional Softmax. The AM-Softmax introduces a margin of separation between
the classes that forces the samples from the same class to be closer to each
other and also maximizes the distance between classes. In this paper, we
propose a new approach for speaker recognition systems called AM-SincNet, which
is based on the SincNet but uses an improved AM-Softmax layer. The proposed
method is evaluated in the TIMIT dataset and obtained an improvement of
approximately 40% in the Frame Error Rate compared to SincNet
Full-info Training for Deep Speaker Feature Learning
In recent studies, it has shown that speaker patterns can be learned from
very short speech segments (e.g., 0.3 seconds) by a carefully designed
convolutional & time-delay deep neural network (CT-DNN) model. By enforcing the
model to discriminate the speakers in the training data, frame-level speaker
features can be derived from the last hidden layer. In spite of its good
performance, a potential problem of the present model is that it involves a
parametric classifier, i.e., the last affine layer, which may consume some
discriminative knowledge, thus leading to `information leak' for the feature
learning. This paper presents a full-info training approach that discards the
parametric classifier and enforces all the discriminative knowledge learned by
the feature net. Our experiments on the Fisher database demonstrate that this
new training scheme can produce more coherent features, leading to consistent
and notable performance improvement on the speaker verification task.Comment: Accepted by ICASSP 201
Factorization of Discriminatively Trained i-vector Extractor for Speaker Recognition
In this work, we continue in our research on i-vector extractor for speaker
verification (SV) and we optimize its architecture for fast and effective
discriminative training. We were motivated by computational and memory
requirements caused by the large number of parameters of the original
generative i-vector model. Our aim is to preserve the power of the original
generative model, and at the same time focus the model towards extraction of
speaker-related information. We show that it is possible to represent a
standard generative i-vector extractor by a model with significantly less
parameters and obtain similar performance on SV tasks. We can further refine
this compact model by discriminative training and obtain i-vectors that lead to
better performance on various SV benchmarks representing different acoustic
domains.Comment: Submitted to Interspeech 2019, Graz, Austria. arXiv admin note:
substantial text overlap with arXiv:1810.1318
- …