977 research outputs found
Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System
In this paper, we explore the encoding/pooling layer and loss function in the
end-to-end speaker and language recognition system. First, a unified and
interpretable end-to-end system for both speaker and language recognition is
developed. It accepts variable-length input and produces an utterance level
result. In the end-to-end system, the encoding layer plays a role in
aggregating the variable-length input sequence into an utterance level
representation. Besides the basic temporal average pooling, we introduce a
self-attentive pooling layer and a learnable dictionary encoding layer to get
the utterance level representation. In terms of loss function for open-set
speaker verification, to get more discriminative speaker embedding, center loss
and angular softmax loss is introduced in the end-to-end system. Experimental
results on Voxceleb and NIST LRE 07 datasets show that the performance of
end-to-end learning system could be significantly improved by the proposed
encoding layer and loss function.Comment: Accepted for Speaker Odyssey 201
Spatial Pyramid Encoding with Convex Length Normalization for Text-Independent Speaker Verification
In this paper, we propose a new pooling method called spatial pyramid
encoding (SPE) to generate speaker embeddings for text-independent speaker
verification. We first partition the output feature maps from a deep residual
network (ResNet) into increasingly fine sub-regions and extract speaker
embeddings from each sub-region through a learnable dictionary encoding layer.
These embeddings are concatenated to obtain the final speaker representation.
The SPE layer not only generates a fixed-dimensional speaker embedding for a
variable-length speech segment, but also aggregates the information of feature
distribution from multi-level temporal bins. Furthermore, we apply deep length
normalization by augmenting the loss function with ring loss. By applying ring
loss, the network gradually learns to normalize the speaker embeddings using
model weights themselves while preserving convexity, leading to more robust
speaker embeddings. Experiments on the VoxCeleb1 dataset show that the proposed
system using the SPE layer and ring loss-based deep length normalization
outperforms both i-vector and d-vector baselines.Comment: 5 pages, 2 figures, Interspeech 201
Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances
Currently, the most widely used approach for speaker verification is the deep
speaker embedding learning. In this approach, we obtain a speaker embedding
vector by pooling single-scale features that are extracted from the last layer
of a speaker feature extractor. Multi-scale aggregation (MSA), which utilizes
multi-scale features from different layers of the feature extractor, has
recently been introduced and shows superior performance for variable-duration
utterances. To increase the robustness dealing with utterances of arbitrary
duration, this paper improves the MSA by using a feature pyramid module. The
module enhances speaker-discriminative information of features from multiple
layers via a top-down pathway and lateral connections. We extract speaker
embeddings using the enhanced features that contain rich speaker information
with different time scales. Experiments on the VoxCeleb dataset show that the
proposed module improves previous MSA methods with a smaller number of
parameters. It also achieves better performance than state-of-the-art
approaches for both short and long utterances.Comment: Accepted to Interspeech 202
- …