Seeing voices and hearing voices: learning discriminative embeddings
  using cross-modal self-supervision

Chung, Joon Son; Chung, Soo-Whan; Kang, Hong Goo

research

Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision

Authors: Joon Son Chung
Soo-Whan Chung
Hong Goo Kang
Publication date: 6 May 2020
Publisher: 'International Speech Communication Association'
Doi

Abstract

The goal of this work is to train discriminative cross-modal embeddings without access to manually annotated data. Recent advances in self-supervised learning have shown that effective representations can be learnt from natural cross-modal synchrony. We build on earlier work to train embeddings that are more discriminative for uni-modal downstream tasks. To this end, we propose a novel training strategy that not only optimises metrics across modalities, but also enforces intra-class feature separation within each of the modalities. The effectiveness of the method is demonstrated on two downstream tasks: lip reading using the features trained on audio-visual synchronisation, and speaker recognition using the features trained for cross-modal biometric matching. The proposed method outperforms state-of-the-art self-supervised baselines by a signficant margin.Comment: Under submission as a conference pape

Similar works

Full text

Available Versions

Crossref

Last time updated on 11/08/2021