1 research outputs found
Dual-stream Time-Delay Neural Network with Dynamic Global Filter for Speaker Verification
The time-delay neural network (TDNN) is one of the state-of-the-art models
for text-independent speaker verification. However, it is difficult for
conventional TDNN to capture global context that has been proven critical for
robust speaker representations and long-duration speaker verification in many
recent works. Besides, the common solutions, e.g., self-attention, have
quadratic complexity for input tokens, which makes them computationally
unaffordable when applied to the feature maps with large sizes in TDNN. To
address these issues, we propose the Global Filter for TDNN, which applies
log-linear complexity FFT/IFFT and a set of differentiable frequency-domain
filters to efficiently model the long-term dependencies in speech. Besides, a
dynamic filtering strategy, and a sparse regularization method are specially
designed to enhance the performance of the global filter and prevent it from
overfitting. Furthermore, we construct a dual-stream TDNN (DS-TDNN), which
splits the basic channels for complexity reduction and employs the global
filter to increase recognition performance. Experiments on Voxceleb and SITW
databases show that the DS-TDNN achieves approximate 10% improvement with a
decline over 28% and 15% in complexity and parameters compared with the
ECAPA-TDNN. Besides, it has the best trade-off between efficiency and
effectiveness compared with other popular baseline systems when facing
long-duration speech. Finally, visualizations and a detailed ablation study
further reveal the advantages of the DS-TDNN.Comment: 13 pages 4 figure