Current speaker verification techniques rely on a neural network to extract
speaker representations. The successful x-vector architecture is a Time Delay
Neural Network (TDNN) that applies statistics pooling to project
variable-length utterances into fixed-length speaker characterizing embeddings.
In this paper, we propose multiple enhancements to this architecture based on
recent trends in the related fields of face verification and computer vision.
Firstly, the initial frame layers can be restructured into 1-dimensional
Res2Net modules with impactful skip connections. Similarly to SE-ResNet, we
introduce Squeeze-and-Excitation blocks in these modules to explicitly model
channel interdependencies. The SE block expands the temporal context of the
frame layer by rescaling the channels according to global properties of the
recording. Secondly, neural networks are known to learn hierarchical features,
with each layer operating on a different level of complexity. To leverage this
complementary information, we aggregate and propagate features of different
hierarchical levels. Finally, we improve the statistics pooling module with
channel-dependent frame attention. This enables the network to focus on
different subsets of frames during each of the channel's statistics estimation.
The proposed ECAPA-TDNN architecture significantly outperforms state-of-the-art
TDNN based systems on the VoxCeleb test sets and the 2019 VoxCeleb Speaker
Recognition Challenge.Comment: proceedings of INTERSPEECH 202