2 research outputs found
Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition
Recently, speaker embeddings extracted from a speaker discriminative deep
neural network (DNN) yield better performance than the conventional methods
such as i-vector. In most cases, the DNN speaker classifier is trained using
cross entropy loss with softmax. However, this kind of loss function does not
explicitly encourage inter-class separability and intra-class compactness. As a
result, the embeddings are not optimal for speaker recognition tasks. In this
paper, to address this issue, three different margin based losses which not
only separate classes but also demand a fixed margin between classes are
introduced to deep speaker embedding learning. It could be demonstrated that
the margin is the key to obtain more discriminative speaker embeddings.
Experiments are conducted on two public text independent tasks: VoxCeleb1 and
Speaker in The Wild (SITW). The proposed approach can achieve the
state-of-the-art performance, with 25% ~ 30% equal error rate (EER) reduction
on both tasks when compared to strong baselines using cross entropy loss with
softmax, obtaining 2.238% EER on VoxCeleb1 test set and 2.761% EER on SITW
core-core test set, respectively.Comment: not accepted by INTERSPEECH 201
Speaker Recognition Based on Deep Learning: An Overview
Speaker recognition is a task of identifying persons from their voices.
Recently, deep learning has dramatically revolutionized speaker recognition.
However, there is lack of comprehensive reviews on the exciting progress.
In this paper, we review several major subtasks of speaker recognition,
including speaker verification, identification, diarization, and robust speaker
recognition, with a focus on deep-learning-based methods. Because the major
advantage of deep learning over conventional methods is its representation
ability, which is able to produce highly abstract embedding features from
utterances, we first pay close attention to deep-learning-based speaker feature
extraction, including the inputs, network structures, temporal pooling
strategies, and objective functions respectively, which are the fundamental
components of many speaker recognition subtasks. Then, we make an overview of
speaker diarization, with an emphasis of recent supervised, end-to-end, and
online diarization. Finally, we survey robust speaker recognition from the
perspectives of domain adaptation and speech enhancement, which are two major
approaches of dealing with domain mismatch and noise problems. Popular and
recently released corpora are listed at the end of the paper