8,708 research outputs found
Learnable PINs: Cross-Modal Embeddings for Person Identity
We propose and investigate an identity sensitive joint embedding of face and
voice. Such an embedding enables cross-modal retrieval from voice to face and
from face to voice. We make the following four contributions: first, we show
that the embedding can be learnt from videos of talking faces, without
requiring any identity labels, using a form of cross-modal self-supervision;
second, we develop a curriculum learning schedule for hard negative mining
targeted to this task, that is essential for learning to proceed successfully;
third, we demonstrate and evaluate cross-modal retrieval for identities unseen
and unheard during training over a number of scenarios and establish a
benchmark for this novel task; finally, we show an application of using the
joint embedding for automatically retrieving and labelling characters in TV
dramas.Comment: To appear in ECCV 201
ASR is all you need: cross-modal distillation for lip reading
The goal of this work is to train strong models for visual speech recognition
without requiring human annotated ground truth data. We achieve this by
distilling from an Automatic Speech Recognition (ASR) model that has been
trained on a large-scale audio-only corpus. We use a cross-modal distillation
method that combines Connectionist Temporal Classification (CTC) with a
frame-wise cross-entropy loss. Our contributions are fourfold: (i) we show that
ground truth transcriptions are not necessary to train a lip reading system;
(ii) we show how arbitrary amounts of unlabelled video data can be leveraged to
improve performance; (iii) we demonstrate that distillation significantly
speeds up training; and, (iv) we obtain state-of-the-art results on the
challenging LRS2 and LRS3 datasets for training only on publicly available
data.Comment: ICASSP 202
Teach me with a Whisper: Enhancing Large Language Models for Analyzing Spoken Transcripts using Speech Embeddings
Speech data has rich acoustic and paralinguistic information with important
cues for understanding a speaker's tone, emotion, and intent, yet traditional
large language models such as BERT do not incorporate this information. There
has been an increased interest in multi-modal language models leveraging audio
and/or visual information and text. However, current multi-modal language
models require both text and audio/visual data streams during inference/test
time. In this work, we propose a methodology for training language models
leveraging spoken language audio data but without requiring the audio stream
during prediction time. This leads to an improved language model for analyzing
spoken transcripts while avoiding an audio processing overhead at test time. We
achieve this via an audio-language knowledge distillation framework, where we
transfer acoustic and paralinguistic information from a pre-trained speech
embedding (OpenAI Whisper) teacher model to help train a student language model
on an audio-text dataset. In our experiments, the student model achieves
consistent improvement over traditional language models on tasks analyzing
spoken transcripts.Comment: 11 page
Radio2Text: Streaming Speech Recognition Using mmWave Radio Signals
Millimeter wave (mmWave) based speech recognition provides more possibility
for audio-related applications, such as conference speech transcription and
eavesdropping. However, considering the practicality in real scenarios, latency
and recognizable vocabulary size are two critical factors that cannot be
overlooked. In this paper, we propose Radio2Text, the first mmWave-based system
for streaming automatic speech recognition (ASR) with a vocabulary size
exceeding 13,000 words. Radio2Text is based on a tailored streaming Transformer
that is capable of effectively learning representations of speech-related
features, paving the way for streaming ASR with a large vocabulary. To
alleviate the deficiency of streaming networks unable to access entire future
inputs, we propose the Guidance Initialization that facilitates the transfer of
feature knowledge related to the global context from the non-streaming
Transformer to the tailored streaming Transformer through weight inheritance.
Further, we propose a cross-modal structure based on knowledge distillation
(KD), named cross-modal KD, to mitigate the negative effect of low quality
mmWave signals on recognition performance. In the cross-modal KD, the audio
streaming Transformer provides feature and response guidance that inherit
fruitful and accurate speech information to supervise the training of the
tailored radio streaming Transformer. The experimental results show that our
Radio2Text can achieve a character error rate of 5.7% and a word error rate of
9.4% for the recognition of a vocabulary consisting of over 13,000 words.Comment: Accepted by Proceedings of the ACM on Interactive, Mobile, Wearable
and Ubiquitous Technologies (ACM IMWUT/UbiComp 2023
Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement
Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech along
with extra visual information such as lip videos, and has been shown to be more
effective than audio-only speech enhancement. This paper proposes the
incorporation of ultrasound tongue images to improve the performance of
lip-based AV-SE systems further. To address the challenge of acquiring
ultrasound tongue images during inference, we first propose to employ knowledge
distillation during training to investigate the feasibility of leveraging
tongue-related information without directly inputting ultrasound tongue images.
Specifically, we guide an audio-lip speech enhancement student model to learn
from a pre-trained audio-lip-tongue speech enhancement teacher model, thus
transferring tongue-related knowledge. To better model the alignment between
the lip and tongue modalities, we further propose the introduction of a
lip-tongue key-value memory network into the AV-SE model. This network enables
the retrieval of tongue features based on readily available lip features,
thereby assisting the subsequent speech enhancement task. Experimental results
demonstrate that both methods significantly improve the quality and
intelligibility of the enhanced speech compared to traditional lip-based AV-SE
baselines. Moreover, both proposed methods exhibit strong generalization
performance on unseen speakers and in the presence of unseen noises.
Furthermore, phone error rate (PER) analysis of automatic speech recognition
(ASR) reveals that while all phonemes benefit from introducing ultrasound
tongue images, palatal and velar consonants benefit most.Comment: Submmited to IEEE/ACM Transactions on Audio, Speech and Language
Processing. arXiv admin note: text overlap with arXiv:2305.1493
Enhanced Multimodal Representation Learning with Cross-modal KD
This paper explores the tasks of leveraging auxiliary modalities which are
only available at training to enhance multimodal representation learning
through cross-modal Knowledge Distillation (KD). The widely adopted mutual
information maximization-based objective leads to a short-cut solution of the
weak teacher, i.e., achieving the maximum mutual information by simply making
the teacher model as weak as the student model. To prevent such a weak
solution, we introduce an additional objective term, i.e., the mutual
information between the teacher and the auxiliary modality model. Besides, to
narrow down the information gap between the student and teacher, we further
propose to minimize the conditional entropy of the teacher given the student.
Novel training schemes based on contrastive learning and adversarial learning
are designed to optimize the mutual information and the conditional entropy,
respectively. Experimental results on three popular multimodal benchmark
datasets have shown that the proposed method outperforms a range of
state-of-the-art approaches for video recognition, video retrieval and emotion
classification.Comment: Accepted by CVPR202
Multimodal Transformer Distillation for Audio-Visual Synchronization
Audio-visual synchronization aims to determine whether the mouth movements
and speech in the video are synchronized. VocaLiST reaches state-of-the-art
performance by incorporating multimodal Transformers to model audio-visual
interact information. However, it requires high computing resources, making it
impractical for real-world applications. This paper proposed an MTDVocaLiST
model, which is trained by our proposed multimodal Transformer distillation
(MTD) loss. MTD loss enables MTDVocaLiST model to deeply mimic the
cross-attention distribution and value-relation in the Transformer of VocaLiST.
Our proposed method is effective in two aspects: From the distillation method
perspective, MTD loss outperforms other strong distillation baselines. From the
distilled model's performance perspective: 1) MTDVocaLiST outperforms
similar-size SOTA models, SyncNet, and PM models by 15.69% and 3.39%; 2)
MTDVocaLiST reduces the model size of VocaLiST by 83.52%, yet still maintaining
similar performance.Comment: Submitted to ICASSP 202
- …