131 research outputs found
Attention-Based Models for Text-Dependent Speaker Verification
Attention-based models have recently shown great performance on a range of
tasks, such as speech recognition, machine translation, and image captioning
due to their ability to summarize relevant information that expands through the
entire length of an input sequence. In this paper, we analyze the usage of
attention mechanisms to the problem of sequence summarization in our end-to-end
text-dependent speaker recognition system. We explore different topologies and
their variants of the attention layer, and compare different pooling methods on
the attention weights. Ultimately, we show that attention-based models can
improves the Equal Error Rate (EER) of our speaker verification system by
relatively 14% compared to our non-attention LSTM baseline model.Comment: Submitted to ICASSP 201
Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision
The goal of this work is to train discriminative cross-modal embeddings
without access to manually annotated data. Recent advances in self-supervised
learning have shown that effective representations can be learnt from natural
cross-modal synchrony. We build on earlier work to train embeddings that are
more discriminative for uni-modal downstream tasks. To this end, we propose a
novel training strategy that not only optimises metrics across modalities, but
also enforces intra-class feature separation within each of the modalities. The
effectiveness of the method is demonstrated on two downstream tasks: lip
reading using the features trained on audio-visual synchronisation, and speaker
recognition using the features trained for cross-modal biometric matching. The
proposed method outperforms state-of-the-art self-supervised baselines by a
signficant margin.Comment: Under submission as a conference pape
- …