Search CORE

131 research outputs found

Attention-Based Models for Text-Dependent Speaker Verification

Author: Chowdhury F A Rezaur Rahman
Moreno Ignacio Lopez
Wan Li
Wang Quan
Publication venue
Publication date: 31/01/2018
Field of study

Attention-based models have recently shown great performance on a range of tasks, such as speech recognition, machine translation, and image captioning due to their ability to summarize relevant information that expands through the entire length of an input sequence. In this paper, we analyze the usage of attention mechanisms to the problem of sequence summarization in our end-to-end text-dependent speaker recognition system. We explore different topologies and their variants of the attention layer, and compare different pooling methods on the attention weights. Ultimately, we show that attention-based models can improves the Equal Error Rate (EER) of our speaker verification system by relatively 14% compared to our non-attention LSTM baseline model.Comment: Submitted to ICASSP 201

arXiv.org e-Print Archive

Crossref

Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision

Author: Chung Joon Son
Chung Soo-Whan
Kang Hong Goo
Publication venue: 'International Speech Communication Association'
Publication date: 06/05/2020
Field of study

The goal of this work is to train discriminative cross-modal embeddings without access to manually annotated data. Recent advances in self-supervised learning have shown that effective representations can be learnt from natural cross-modal synchrony. We build on earlier work to train embeddings that are more discriminative for uni-modal downstream tasks. To this end, we propose a novel training strategy that not only optimises metrics across modalities, but also enforces intra-class feature separation within each of the modalities. The effectiveness of the method is demonstrated on two downstream tasks: lip reading using the features trained on audio-visual synchronisation, and speaker recognition using the features trained for cross-modal biometric matching. The proposed method outperforms state-of-the-art self-supervised baselines by a signficant margin.Comment: Under submission as a conference pape

arXiv.org e-Print Archive

Crossref