2,574 research outputs found
Frame-level speaker embeddings for text-independent speaker recognition and analysis of end-to-end model
In this paper, we propose a Convolutional Neural Network (CNN) based speaker
recognition model for extracting robust speaker embeddings. The embedding can
be extracted efficiently with linear activation in the embedding layer. To
understand how the speaker recognition model operates with text-independent
input, we modify the structure to extract frame-level speaker embeddings from
each hidden layer. We feed utterances from the TIMIT dataset to the trained
network and use several proxy tasks to study the networks ability to represent
speech input and differentiate voice identity. We found that the networks are
better at discriminating broad phonetic classes than individual phonemes. In
particular, frame-level embeddings that belong to the same phonetic classes are
similar (based on cosine distance) for the same speaker. The frame level
representation also allows us to analyze the networks at the frame level, and
has the potential for other analyses to improve speaker recognition.Comment: Accepted at SLT 2018; Supplement materials:
https://people.csail.mit.edu/swshon/supplement/slt18.htm
Deep Speaker: an End-to-End Neural Speaker Embedding System
We present Deep Speaker, a neural speaker embedding system that maps
utterances to a hypersphere where speaker similarity is measured by cosine
similarity. The embeddings generated by Deep Speaker can be used for many
tasks, including speaker identification, verification, and clustering. We
experiment with ResCNN and GRU architectures to extract the acoustic features,
then mean pool to produce utterance-level speaker embeddings, and train using
triplet loss based on cosine similarity. Experiments on three distinct datasets
suggest that Deep Speaker outperforms a DNN-based i-vector baseline. For
example, Deep Speaker reduces the verification equal error rate by 50%
(relatively) and improves the identification accuracy by 60% (relatively) on a
text-independent dataset. We also present results that suggest adapting from a
model trained with Mandarin can improve accuracy for English speaker
recognition
Speaker Embedding Extraction with Phonetic Information
Speaker embeddings achieve promising results on many speaker verification
tasks. Phonetic information, as an important component of speech, is rarely
considered in the extraction of speaker embeddings. In this paper, we introduce
phonetic information to the speaker embedding extraction based on the x-vector
architecture. Two methods using phonetic vectors and multi-task learning are
proposed. On the Fisher dataset, our best system outperforms the original
x-vector approach by 20% in EER, and by 15%, 15% in minDCF08 and minDCF10,
respectively. Experiments conducted on NIST SRE10 further demonstrate the
effectiveness of the proposed methods.Comment: submitted to Interspeech 2018 (accepted) and open-sourced. Please
refer to Interspeech for the final versio
Attention Mechanism in Speaker Recognition: What Does It Learn in Deep Speaker Embedding?
This paper presents an experimental study on deep speaker embedding with an
attention mechanism that has been found to be a powerful representation
learning technique in speaker recognition. In this framework, an attention
model works as a frame selector that computes an attention weight for each
frame-level feature vector, in accord with which an utterancelevel
representation is produced at the pooling layer in a speaker embedding network.
In general, an attention model is trained together with the speaker embedding
network on a single objective function, and thus those two components are
tightly bound to one another. In this paper, we consider the possibility that
the attention model might be decoupled from its parent network and assist other
speaker embedding networks and even conventional i-vector extractors. This
possibility is demonstrated through a series of experiments on a NIST Speaker
Recognition Evaluation (SRE) task, with 9.0% EER reduction and 3.8%
min_Cprimary reduction when the attention weights are applied to i-vector
extraction. Another experiment shows that DNN-based soft voice activity
detection (VAD) can be effectively combined with the attention mechanism to
yield further reduction of minCprimary by 6.6% and 1.6% in deep speaker
embedding and i-vector systems, respectively.Comment: SLT 2018 (Workshop on Spoken Language Technology
Deep Segment Attentive Embedding for Duration Robust Speaker Verification
LSTM-based speaker verification usually uses a fixed-length local segment
randomly truncated from an utterance to learn the utterance-level speaker
embedding, while using the average embedding of all segments of a test
utterance to verify the speaker, which results in a critical mismatch between
testing and training. This mismatch degrades the performance of speaker
verification, especially when the durations of training and testing utterances
are very different. To alleviate this issue, we propose the deep segment
attentive embedding method to learn the unified speaker embeddings for
utterances of variable duration. Each utterance is segmented by a sliding
window and LSTM is used to extract the embedding of each segment. Instead of
only using one local segment, we use the whole utterance to learn the
utterance-level embedding by applying an attentive pooling to the embeddings of
all segments. Moreover, the similarity loss of segment-level embeddings is
introduced to guide the segment attention to focus on the segments with more
speaker discriminations, and jointly optimized with the similarity loss of
utterance-level embeddings. Systematic experiments on Tongdun and VoxCeleb show
that the proposed method significantly improves robustness of duration variant
and achieves the relative Equal Error Rate reduction of 50% and 11.54% ,
respectively
RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification
Recently, direct modeling of raw waveforms using deep neural networks has
been widely studied for a number of tasks in audio domains. In speaker
verification, however, utilization of raw waveforms is in its preliminary
phase, requiring further investigation. In this study, we explore end-to-end
deep neural networks that input raw waveforms to improve various aspects:
front-end speaker embedding extraction including model architecture,
pre-training scheme, additional objective functions, and back-end
classification. Adjustment of model architecture using a pre-training scheme
can extract speaker embeddings, giving a significant improvement in
performance. Additional objective functions simplify the process of extracting
speaker embeddings by merging conventional two-phase processes: extracting
utterance-level features such as i-vectors or x-vectors and the feature
enhancement phase, e.g., linear discriminant analysis. Effective back-end
classification models that suit the proposed speaker embedding are also
explored. We propose an end-to-end system that comprises two deep neural
networks, one front-end for utterance-level speaker embedding extraction and
the other for back-end classification. Experiments conducted on the VoxCeleb1
dataset demonstrate that the proposed model achieves state-of-the-art
performance among systems without data augmentation. The proposed system is
also comparable to the state-of-the-art x-vector system that adopts data
augmentation.Comment: Accepted for oral presentation at Interspeech 2019, code available at
http://github.com/Jungjee/RawNe
Generative x-vectors for text-independent speaker verification
Speaker verification (SV) systems using deep neural network embeddings,
so-called the x-vector systems, are becoming popular due to its good
performance superior to the i-vector systems. The fusion of these systems
provides improved performance benefiting both from the discriminatively trained
x-vectors and generative i-vectors capturing distinct speaker characteristics.
In this paper, we propose a novel method to include the complementary
information of i-vector and x-vector, that is called generative x-vector. The
generative x-vector utilizes a transformation model learned from the i-vector
and x-vector representations of the background data. Canonical correlation
analysis is applied to derive this transformation model, which is later used to
transform the standard x-vectors of the enrollment and test segments to the
corresponding generative x-vectors. The SV experiments performed on the NIST
SRE 2010 dataset demonstrate that the system using generative x-vectors
provides considerably better performance than the baseline i-vector and
x-vector systems. Furthermore, the generative x-vectors outperform the fusion
of i-vector and x-vector systems for long-duration utterances, while yielding
comparable results for short-duration utterances.Comment: Accepted for publication at SLT 201
Multi-Task Learning with High-Order Statistics for X-vector based Text-Independent Speaker Verification
The x-vector based deep neural network (DNN) embedding systems have
demonstrated effectiveness for text-independent speaker verification. This
paper presents a multi-task learning architecture for training the speaker
embedding DNN with the primary task of classifying the target speakers, and the
auxiliary task of reconstructing the first- and higher-order statistics of the
original input utterance. The proposed training strategy aggregates both the
supervised and unsupervised learning into one framework to make the speaker
embeddings more discriminative and robust. Experiments are carried out using
the NIST SRE16 evaluation dataset and the VOiCES dataset. The results
demonstrate that our proposed method outperforms the original x-vector approach
with very low additional complexity added.Comment: 5 pages,2 figures, submitted to INTERSPEECH 201
Self-supervised speaker embeddings
Contrary to i-vectors, speaker embeddings such as x-vectors are incapable of
leveraging unlabelled utterances, due to the classification loss over training
speakers. In this paper, we explore an alternative training strategy to enable
the use of unlabelled utterances in training. We propose to train speaker
embedding extractors via reconstructing the frames of a target speech segment,
given the inferred embedding of another speech segment of the same utterance.
We do this by attaching to the standard speaker embedding extractor a decoder
network, which we feed not merely with the speaker embedding, but also with the
estimated phone sequence of the target frame sequence. The reconstruction loss
can be used either as a single objective, or be combined with the standard
speaker classification loss. In the latter case, it acts as a regularizer,
encouraging generalizability to speakers unseen during training. In all cases,
the proposed architectures are trained from scratch and in an end-to-end
fashion. We demonstrate the benefits from the proposed approach on VoxCeleb and
Speakers in the wild, and we report notable improvements over the baseline.Comment: Preprint. Submitted to Interspeech 2019. Updated results compared to
first version and minor correction
Angular Softmax Loss for End-to-end Speaker Verification
End-to-end speaker verification systems have received increasing interests.
The traditional i-vector approach trains a generative model (basically a
factor-analysis model) to extract i-vectors as speaker embeddings. In contrast,
the end-to-end approach directly trains a discriminative model (often a neural
network) to learn discriminative speaker embeddings; a crucial component is the
training criterion. In this paper, we use angular softmax (A-softmax), which is
originally proposed for face verification, as the loss function for feature
learning in end-to-end speaker verification. By introducing margins between
classes into softmax loss, A-softmax can learn more discriminative features
than softmax loss and triplet loss, and at the same time, is easy and stable
for usage. We make two contributions in this work. 1) We introduce A-softmax
loss into end-to-end speaker verification and achieve significant EER
reductions. 2) We find that the combination of using A-softmax in training the
front-end and using PLDA in the back-end scoring further boosts the performance
of end-to-end systems under short utterance condition (short in both enrollment
and test). Experiments are conducted on part of dataset and
demonstrate the improvements of using A-softmax
- …