1,252 research outputs found
Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System
In this paper, we explore the encoding/pooling layer and loss function in the
end-to-end speaker and language recognition system. First, a unified and
interpretable end-to-end system for both speaker and language recognition is
developed. It accepts variable-length input and produces an utterance level
result. In the end-to-end system, the encoding layer plays a role in
aggregating the variable-length input sequence into an utterance level
representation. Besides the basic temporal average pooling, we introduce a
self-attentive pooling layer and a learnable dictionary encoding layer to get
the utterance level representation. In terms of loss function for open-set
speaker verification, to get more discriminative speaker embedding, center loss
and angular softmax loss is introduced in the end-to-end system. Experimental
results on Voxceleb and NIST LRE 07 datasets show that the performance of
end-to-end learning system could be significantly improved by the proposed
encoding layer and loss function.Comment: Accepted for Speaker Odyssey 201
An Effective Transformer-based Contextual Model and Temporal Gate Pooling for Speaker Identification
Wav2vec2 has achieved success in applying Transformer architecture and
self-supervised learning to speech recognition. Recently, these have come to be
used not only for speech recognition but also for the entire speech processing.
This paper introduces an effective end-to-end speaker identification model
applied Transformer-based contextual model. We explored the relationship
between the hyper-parameters and the performance in order to discern the
structure of an effective model. Furthermore, we propose a pooling method,
Temporal Gate Pooling, with powerful learning ability for speaker
identification. We applied Conformer as encoder and BEST-RQ for pre-training
and conducted an evaluation utilizing the speaker identification of VoxCeleb1.
The proposed method has achieved an accuracy of 87.1% with 28.5M parameters,
demonstrating comparable precision to wav2vec2 with 317.7M parameters. Code is
available at https://github.com/HarunoriKawano/speaker-identification-with-tgp.Comment: 5 pages, 3 figure
Utterance-level Aggregation For Speaker Recognition In The Wild
The objective of this paper is speaker recognition "in the wild"-where
utterances may be of variable length and also contain irrelevant signals.
Crucial elements in the design of deep networks for this task are the type of
trunk (frame level) network, and the method of temporal aggregation. We propose
a powerful speaker recognition deep network, using a "thin-ResNet" trunk
architecture, and a dictionary-based NetVLAD or GhostVLAD layer to aggregate
features across time, that can be trained end-to-end. We show that our network
achieves state of the art performance by a significant margin on the VoxCeleb1
test set for speaker recognition, whilst requiring fewer parameters than
previous methods. We also investigate the effect of utterance length on
performance, and conclude that for "in the wild" data, a longer length is
beneficial.Comment: To appear in: International Conference on Acoustics, Speech, and
Signal Processing (ICASSP), 2019. (Oral Presentation
- …