2,106 research outputs found
Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances
Currently, the most widely used approach for speaker verification is the deep
speaker embedding learning. In this approach, we obtain a speaker embedding
vector by pooling single-scale features that are extracted from the last layer
of a speaker feature extractor. Multi-scale aggregation (MSA), which utilizes
multi-scale features from different layers of the feature extractor, has
recently been introduced and shows superior performance for variable-duration
utterances. To increase the robustness dealing with utterances of arbitrary
duration, this paper improves the MSA by using a feature pyramid module. The
module enhances speaker-discriminative information of features from multiple
layers via a top-down pathway and lateral connections. We extract speaker
embeddings using the enhanced features that contain rich speaker information
with different time scales. Experiments on the VoxCeleb dataset show that the
proposed module improves previous MSA methods with a smaller number of
parameters. It also achieves better performance than state-of-the-art
approaches for both short and long utterances.Comment: Accepted to Interspeech 202
ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification
Current speaker verification techniques rely on a neural network to extract
speaker representations. The successful x-vector architecture is a Time Delay
Neural Network (TDNN) that applies statistics pooling to project
variable-length utterances into fixed-length speaker characterizing embeddings.
In this paper, we propose multiple enhancements to this architecture based on
recent trends in the related fields of face verification and computer vision.
Firstly, the initial frame layers can be restructured into 1-dimensional
Res2Net modules with impactful skip connections. Similarly to SE-ResNet, we
introduce Squeeze-and-Excitation blocks in these modules to explicitly model
channel interdependencies. The SE block expands the temporal context of the
frame layer by rescaling the channels according to global properties of the
recording. Secondly, neural networks are known to learn hierarchical features,
with each layer operating on a different level of complexity. To leverage this
complementary information, we aggregate and propagate features of different
hierarchical levels. Finally, we improve the statistics pooling module with
channel-dependent frame attention. This enables the network to focus on
different subsets of frames during each of the channel's statistics estimation.
The proposed ECAPA-TDNN architecture significantly outperforms state-of-the-art
TDNN based systems on the VoxCeleb test sets and the 2019 VoxCeleb Speaker
Recognition Challenge.Comment: proceedings of INTERSPEECH 202
MFA-Conformer: Multi-scale Feature Aggregation Conformer for Automatic Speaker Verification
In this paper, we present Multi-scale Feature Aggregation Conformer
(MFA-Conformer), an easy-to-implement, simple but effective backbone for
automatic speaker verification based on the Convolution-augmented Transformer
(Conformer). The architecture of the MFA-Conformer is inspired by recent
state-of-the-art models in speech recognition and speaker verification.
Firstly, we introduce a convolution sub-sampling layer to decrease the
computational cost of the model. Secondly, we adopt Conformer blocks which
combine Transformers and convolution neural networks (CNNs) to capture global
and local features effectively. Finally, the output feature maps from all
Conformer blocks are concatenated to aggregate multi-scale representations
before final pooling. We evaluate the MFA-Conformer on the widely used
benchmarks. The best system obtains 0.64%, 1.29% and 1.63% EER on VoxCeleb1-O,
SITW.Dev, and SITW.Eval set, respectively. MFA-Conformer significantly
outperforms the popular ECAPA-TDNN systems in both recognition performance and
inference speed. Last but not the least, the ablation studies clearly
demonstrate that the combination of global and local feature learning can lead
to robust and accurate speaker embedding extraction. We will release the code
for future works to do comparison.Comment: submitted to INTERSPEECH 202
- …