2,812 research outputs found
Speaker Verification By Partial AUC Optimization With Mahalanobis Distance Metric Learning
Receiver operating characteristic (ROC) and detection error tradeoff (DET)
curves are two widely used evaluation metrics for speaker verification. They
are equivalent since the latter can be obtained by transforming the former's
true positive y-axis to false negative y-axis and then re-scaling both axes by
a probit operator. Real-world speaker verification systems, however, usually
work on part of the ROC curve instead of the entire ROC curve given an
application. Therefore, we propose in this paper to use the area under part of
the ROC curve (pAUC) as a more efficient evaluation metric for speaker
verification. A Mahalanobis distance metric learning based back-end is applied
to optimize pAUC, where the Mahalanobis distance metric learning guarantees
that the optimization objective of the back-end is a convex one so that the
global optimum solution is achievable. To improve the performance of the
state-of-the-art speaker verification systems by the proposed back-end, we
further propose two feature preprocessing techniques based on
length-normalization and probabilistic linear discriminant analysis
respectively. We evaluate the proposed systems on the major languages of NIST
SRE16 and the core tasks of SITW. Experimental results show that the proposed
back-end outperforms the state-of-the-art speaker verification back-ends in
terms of seven evaluation metrics
i Vector used in Speaker Identification by Dimension Compactness
The automatic speaker identification procedure is used to extract features
that help to identify the components of the acoustic signal by discarding all
the other stuff like background noise, emotion, hesitation, etc. The acoustic
signal is generated by a human that is filtered by the shape of the vocal
tract, including tongue, teeth, etc. The shape of the vocal tract determines
and produced, what signal comes out in real time. The analytically develops
shape of the vocal tract, which exhibits envelop for the short time power
spectrum. The ASR needs efficient way of extracting features from the acoustic
signal that is used effectively to makes the shape of the individual vocal
tract. To identify any acoustic signal in the large collection of acoustic
signal i.e. corpora, it needs dimension compactness of total variability space
by using the GMM mean super vector. This work presents the efficient way to
implement dimension compactness in total variability space and using cosine
distance scoring to predict a fast output score for small size utterance.Comment: 6 pages,7 figure
Neural Predictive Coding using Convolutional Neural Networks towards Unsupervised Learning of Speaker Characteristics
Learning speaker-specific features is vital in many applications like speaker
recognition, diarization and speech recognition. This paper provides a novel
approach, we term Neural Predictive Coding (NPC), to learn speaker-specific
characteristics in a completely unsupervised manner from large amounts of
unlabeled training data that even contain many non-speech events and
multi-speaker audio streams. The NPC framework exploits the proposed short-term
active-speaker stationarity hypothesis which assumes two temporally-close short
speech segments belong to the same speaker, and thus a common representation
that can encode the commonalities of both the segments, should capture the
vocal characteristics of that speaker. We train a convolutional deep siamese
network to produce "speaker embeddings" by learning to separate `same' vs
`different' speaker pairs which are generated from an unlabeled data of audio
streams. Two sets of experiments are done in different scenarios to evaluate
the strength of NPC embeddings and compare with state-of-the-art in-domain
supervised methods. First, two speaker identification experiments with
different context lengths are performed in a scenario with comparatively
limited within-speaker channel variability. NPC embeddings are found to perform
the best at short duration experiment, and they provide complementary
information to i-vectors for full utterance experiments. Second, a large scale
speaker verification task having a wide range of within-speaker channel
variability is adopted as an upper-bound experiment where comparisons are drawn
with in-domain supervised methods
Domain Aware Training for Far-field Small-footprint Keyword Spotting
In this paper, we focus on the task of small-footprint keyword spotting under
the far-field scenario. Far-field environments are commonly encountered in
real-life speech applications, causing severe degradation of performance due to
room reverberation and various kinds of noises. Our baseline system is built on
the convolutional neural network trained with pooled data of both far-field and
close-talking speech. To cope with the distortions, we develop three domain
aware training systems, including the domain embedding system, the deep CORAL
system, and the multi-task learning system. These methods incorporate domain
knowledge into network training and improve the performance of the keyword
classifier on far-field conditions. Experimental results show that our proposed
methods manage to maintain the performance on the close-talking speech and
achieve significant improvement on the far-field test set.Comment: Submitted to INTERSPEECH 202
Large Margin Softmax Loss for Speaker Verification
In neural network based speaker verification, speaker embedding is expected
to be discriminative between speakers while the intra-speaker distance should
remain small. A variety of loss functions have been proposed to achieve this
goal. In this paper, we investigate the large margin softmax loss with
different configurations in speaker verification. Ring loss and minimum
hyperspherical energy criterion are introduced to further improve the
performance. Results on VoxCeleb show that our best system outperforms the
baseline approach by 15\% in EER, and by 13\%, 33\% in minDCF08 and minDCF10,
respectively.Comment: submitted to Interspeech 2019. The code and models have been release
Frame-level speaker embeddings for text-independent speaker recognition and analysis of end-to-end model
In this paper, we propose a Convolutional Neural Network (CNN) based speaker
recognition model for extracting robust speaker embeddings. The embedding can
be extracted efficiently with linear activation in the embedding layer. To
understand how the speaker recognition model operates with text-independent
input, we modify the structure to extract frame-level speaker embeddings from
each hidden layer. We feed utterances from the TIMIT dataset to the trained
network and use several proxy tasks to study the networks ability to represent
speech input and differentiate voice identity. We found that the networks are
better at discriminating broad phonetic classes than individual phonemes. In
particular, frame-level embeddings that belong to the same phonetic classes are
similar (based on cosine distance) for the same speaker. The frame level
representation also allows us to analyze the networks at the frame level, and
has the potential for other analyses to improve speaker recognition.Comment: Accepted at SLT 2018; Supplement materials:
https://people.csail.mit.edu/swshon/supplement/slt18.htm
Spoken Pass-Phrase Verification in the i-vector Space
The task of spoken pass-phrase verification is to decide whether a test
utterance contains the same phrase as given enrollment utterances. Beside other
applications, pass-phrase verification can complement an independent speaker
verification subsystem in text-dependent speaker verification. It can also be
used for liveness detection by verifying that the user is able to correctly
respond to a randomly prompted phrase. In this paper, we build on our previous
work on i-vector based text-dependent speaker verification, where we have shown
that i-vectors extracted using phrase specific Hidden Markov Models (HMMs) or
using Deep Neural Network (DNN) based bottle-neck (BN) features help to reject
utterances with wrong pass-phrases. We apply the same i-vector extraction
techniques to the stand-alone task of speaker-independent spoken pass-phrase
classification and verification. The experiments on RSR2015 and RedDots
databases show that very simple scoring techniques (e.g. cosine distance
scoring) applied to such i-vectors can provide results superior to those
previously published on the same data
Deep Segment Attentive Embedding for Duration Robust Speaker Verification
LSTM-based speaker verification usually uses a fixed-length local segment
randomly truncated from an utterance to learn the utterance-level speaker
embedding, while using the average embedding of all segments of a test
utterance to verify the speaker, which results in a critical mismatch between
testing and training. This mismatch degrades the performance of speaker
verification, especially when the durations of training and testing utterances
are very different. To alleviate this issue, we propose the deep segment
attentive embedding method to learn the unified speaker embeddings for
utterances of variable duration. Each utterance is segmented by a sliding
window and LSTM is used to extract the embedding of each segment. Instead of
only using one local segment, we use the whole utterance to learn the
utterance-level embedding by applying an attentive pooling to the embeddings of
all segments. Moreover, the similarity loss of segment-level embeddings is
introduced to guide the segment attention to focus on the segments with more
speaker discriminations, and jointly optimized with the similarity loss of
utterance-level embeddings. Systematic experiments on Tongdun and VoxCeleb show
that the proposed method significantly improves robustness of duration variant
and achieves the relative Equal Error Rate reduction of 50% and 11.54% ,
respectively
Meta-Learning for Short Utterance Speaker Recognition with Imbalance Length Pairs
In practical settings, a speaker recognition system needs to identify a
speaker given a short utterance, while the enrollment utterance may be
relatively long. However, existing speaker recognition models perform poorly
with such short utterances. To solve this problem, we introduce a meta-learning
framework for imbalance length pairs. Specifically, we use a Prototypical
Networks and train it with a support set of long utterances and a query set of
short utterances of varying lengths. Further, since optimizing only for the
classes in the given episode may be insufficient for learning discriminative
embeddings for unseen classes, we additionally enforce the model to classify
both the support and the query set against the entire set of classes in the
training set. By combining these two learning schemes, our model outperforms
existing state-of-the-art speaker verification models learned with a standard
supervised learning framework on short utterance (1-2 seconds) on the VoxCeleb
datasets. We also validate our proposed model for unseen speaker
identification, on which it also achieves significant performance gains over
the existing approaches. The codes are available at
https://github.com/seongmin-kye/meta-SR.Comment: Accepted to Interspeech 2020. The codes are available at
https://github.com/seongmin-kye/meta-S
Short utterance compensation in speaker verification via cosine-based teacher-student learning of speaker embeddings
The short duration of an input utterance is one of the most critical threats
that degrade the performance of speaker verification systems. This study aimed
to develop an integrated text-independent speaker verification system that
inputs utterances with short duration of 2 seconds or less. We propose an
approach using a teacher-student learning framework for this goal, applied to
short utterance compensation for the first time in our knowledge. The core
concept of the proposed system is to conduct the compensation throughout the
network that extracts the speaker embedding, mainly in phonetic-level, rather
than compensating via a separate system after extracting the speaker embedding.
In the proposed architecture, phonetic-level features where each feature
represents a segment of 130 ms are extracted using convolutional layers. A
layer of gated recurrent units extracts an utterance-level feature using
phonetic-level features. The proposed approach also adopts a new objective
function for teacher-student learning that considers both Kullback-Leibler
divergence of output layers and cosine distance of speaker embeddings layers.
Experiments were conducted using deep neural networks that take raw waveforms
as input, and output speaker embeddings on VoxCeleb1 dataset. The proposed
model could compensate approximately 65 \% of the performance degradation due
to the shortened duration.Comment: 5 pages, 2 figures, submitted to Interspeech 2019 as a conference
pape
- …