8 research outputs found
An Improved Deep Embedding Learning Method for Short Duration Speaker Verification
This paper presents an improved deep embedding learning method based on convolutional neural networks (CNN) for short-duration speaker verification (SV). Existing deep learning-based SV methods generally extract frontend embeddings from a feed-forward deep neural network, in which the long-term speaker characteristics are captured via a pooling operation over the input speech. The extracted embeddings are then scored via a backend model, such as Probabilistic Linear Discriminative Analysis (PLDA).
Two improvements are proposed for frontend embedding learning based on the CNN structure: (1) Motivated by the WaveNet for speech synthesis, dilated filters are designed to achieve a tradeoff between computational efficiency and receptive-filter size; and (2) A novel cross-convolutional-layer pooling method is exploited to capture -order statistics for modelling long-term speaker characteristics. Specifically, the activations of one convolutional layer are aggregated with the guidance of the feature maps from the successive layer. To evaluate the effectiveness of our proposed methods, extensive experiments are conducted on the modified female portion of NIST SRE 2010 evaluations, with conditions ranging from 10s-10s to 5s-4s. Excellent performance has been achieved on each evaluation condition, significantly outperforming existing SV systems using i-vector and d-vector embeddings
Speaker verification using attentive multi-scale convolutional recurrent network
In this paper, we propose a speaker verification method by an Attentive
Multi-scale Convolutional Recurrent Network (AMCRN). The proposed AMCRN can
acquire both local spatial information and global sequential information from
the input speech recordings. In the proposed method, logarithm Mel spectrum is
extracted from each speech recording and then fed to the proposed AMCRN for
learning speaker embedding. Afterwards, the learned speaker embedding is fed to
the back-end classifier (such as cosine similarity metric) for scoring in the
testing stage. The proposed method is compared with state-of-the-art methods
for speaker verification. Experimental data are three public datasets that are
selected from two large-scale speech corpora (VoxCeleb1 and VoxCeleb2).
Experimental results show that our method exceeds baseline methods in terms of
equal error rate and minimal detection cost function, and has advantages over
most of baseline methods in terms of computational complexity and memory
requirement. In addition, our method generalizes well across truncated speech
segments with different durations, and the speaker embedding learned by the
proposed AMCRN has stronger generalization ability across two back-end
classifiers.Comment: 21 pages, 6 figures, 8 tables. Accepted for publication in Applied
Soft Computin