369 research outputs found
Full-info Training for Deep Speaker Feature Learning
In recent studies, it has shown that speaker patterns can be learned from
very short speech segments (e.g., 0.3 seconds) by a carefully designed
convolutional & time-delay deep neural network (CT-DNN) model. By enforcing the
model to discriminate the speakers in the training data, frame-level speaker
features can be derived from the last hidden layer. In spite of its good
performance, a potential problem of the present model is that it involves a
parametric classifier, i.e., the last affine layer, which may consume some
discriminative knowledge, thus leading to `information leak' for the feature
learning. This paper presents a full-info training approach that discards the
parametric classifier and enforces all the discriminative knowledge learned by
the feature net. Our experiments on the Fisher database demonstrate that this
new training scheme can produce more coherent features, leading to consistent
and notable performance improvement on the speaker verification task.Comment: Accepted by ICASSP 201
Max-margin Metric Learning for Speaker Recognition
Probabilistic linear discriminant analysis (PLDA) is a popular normalization
approach for the i-vector model, and has delivered state-of-the-art performance
in speaker recognition. A potential problem of the PLDA model, however, is that
it essentially assumes Gaussian distributions over speaker vectors, which is
not always true in practice. Additionally, the objective function is not
directly related to the goal of the task, e.g., discriminating true speakers
and imposters. In this paper, we propose a max-margin metric learning approach
to solve the problems. It learns a linear transform with a criterion that the
margin between target and imposter trials are maximized. Experiments conducted
on the SRE08 core test show that compared to PLDA, the new approach can obtain
comparable or even better performance, though the scoring is simply a cosine
computation
Phone-aware Neural Language Identification
Pure acoustic neural models, particularly the LSTM-RNN model, have shown
great potential in language identification (LID). However, the phonetic
information has been largely overlooked by most of existing neural LID models,
although this information has been used in the conventional phonetic LID
systems with a great success. We present a phone-aware neural LID architecture,
which is a deep LSTM-RNN LID system but accepts output from an RNN-based ASR
system. By utilizing the phonetic knowledge, the LID performance can be
significantly improved. Interestingly, even if the test language is not
involved in the ASR training, the phonetic knowledge still presents a large
contribution. Our experiments conducted on four languages within the Babel
corpus demonstrated that the phone-aware approach is highly effective.Comment: arXiv admin note: text overlap with arXiv:1705.0315
A Study on Replay Attack and Anti-Spoofing for Automatic Speaker Verification
For practical automatic speaker verification (ASV) systems, replay attack
poses a true risk. By replaying a pre-recorded speech signal of the genuine
speaker, ASV systems tend to be easily fooled. An effective replay detection
method is therefore highly desirable. In this study, we investigate a major
difficulty in replay detection: the over-fitting problem caused by variability
factors in speech signal. An F-ratio probing tool is proposed and three
variability factors are investigated using this tool: speaker identity, speech
content and playback & recording device. The analysis shows that device is the
most influential factor that contributes the highest over-fitting risk. A
frequency warping approach is studied to alleviate the over-fitting problem, as
verified on the ASV-spoof 2017 database
- …