1,611 research outputs found
Learnable PINs: Cross-Modal Embeddings for Person Identity
We propose and investigate an identity sensitive joint embedding of face and
voice. Such an embedding enables cross-modal retrieval from voice to face and
from face to voice. We make the following four contributions: first, we show
that the embedding can be learnt from videos of talking faces, without
requiring any identity labels, using a form of cross-modal self-supervision;
second, we develop a curriculum learning schedule for hard negative mining
targeted to this task, that is essential for learning to proceed successfully;
third, we demonstrate and evaluate cross-modal retrieval for identities unseen
and unheard during training over a number of scenarios and establish a
benchmark for this novel task; finally, we show an application of using the
joint embedding for automatically retrieving and labelling characters in TV
dramas.Comment: To appear in ECCV 201
What do End-to-End Speech Models Learn about Speaker, Language and Channel Information? A Layer-wise and Neuron-level Analysis
End-to-end DNN architectures have pushed the state-of-the-art in speech
technologies, as well as in other spheres of AI, leading researchers to train
more complex and deeper models. These improvements came at the cost of
transparency. DNNs are innately opaque and difficult to interpret. We no longer
understand what features are learned, where they are preserved, and how they
inter-operate. Such an analysis is important for better model understanding,
debugging and to ensure fairness in ethical decision making. In this work, we
analyze the representations trained within deep speech models, towards the task
of speaker recognition, dialect identification and reconstruction of masked
signals. We carry a layer- and neuron-level analysis on the utterance-level
representations captured within pretrained speech models for speaker, language
and channel properties. We study: is this information captured in the learned
representations? where is it preserved? how is it distributed? and can we
identify a minimal subset of network that posses this information. Using
diagnostic classifiers, we answered these questions. Our results reveal: (i)
channel and gender information is omnipresent and is redundantly distributed
(ii) complex properties such as dialectal information is encoded only in the
task-oriented pretrained network and is localised in the upper layers (iii) a
minimal subset of neurons can be extracted to encode the predefined property
(iv) salient neurons are sometimes shared between properties and can highlights
presence of biases in the network. Our cross-architectural comparison indicates
that (v) the pretrained models captures speaker-invariant information and (vi)
the pretrained CNNs models are competitive to the Transformers for encoding
information for the studied properties. To the best of our knowledge, this is
the first study to investigate neuron analysis on the speech models.Comment: Submitted to CSL. Keywords: Speech, Neuron Analysis,
Interpretibility, Diagnostic Classifier, AI explainability, End-to-End
Architectur
Disentangling Prosody Representations with Unsupervised Speech Reconstruction
Human speech can be characterized by different components, including semantic
content, speaker identity and prosodic information. Significant progress has
been made in disentangling representations for semantic content and speaker
identity in Automatic Speech Recognition (ASR) and speaker verification tasks
respectively. However, it is still an open challenging research question to
extract prosodic information because of the intrinsic association of different
attributes, such as timbre and rhythm, and because of the need for supervised
training schemes to achieve robust large-scale and speaker-independent ASR. The
aim of this paper is to address the disentanglement of emotional prosody from
speech based on unsupervised reconstruction. Specifically, we identify, design,
implement and integrate three crucial components in our proposed speech
reconstruction model Prosody2Vec: (1) a unit encoder that transforms speech
signals into discrete units for semantic content, (2) a pretrained speaker
verification model to generate speaker identity embeddings, and (3) a trainable
prosody encoder to learn prosody representations. We first pretrain the
Prosody2Vec representations on unlabelled emotional speech corpora, then
fine-tune the model on specific datasets to perform Speech Emotion Recognition
(SER) and Emotional Voice Conversion (EVC) tasks. Both objective (weighted and
unweighted accuracies) and subjective (mean opinion score) evaluations on the
EVC task suggest that Prosody2Vec effectively captures general prosodic
features that can be smoothly transferred to other emotional speech. In
addition, our SER experiments on the IEMOCAP dataset reveal that the prosody
features learned by Prosody2Vec are complementary and beneficial for the
performance of widely used speech pretraining models and surpass the
state-of-the-art methods when combining Prosody2Vec with HuBERT
representations.Comment: Accepted by IEEE/ACM Transactions on Audio, Speech, and Language
Processin
End-to-End Speech Recognition and Disfluency Removal with Acoustic Language Model Pretraining
The SOTA in transcription of disfluent and conversational speech has in
recent years favored two-stage models, with separate transcription and cleaning
stages. We believe that previous attempts at end-to-end disfluency removal have
fallen short because of the representational advantage that large-scale
language model pretraining has given to lexical models. Until recently, the
high dimensionality and limited availability of large audio datasets inhibited
the development of large-scale self-supervised pretraining objectives for
learning effective audio representations, giving a relative advantage to the
two-stage approach, which utilises pretrained representations for lexical
tokens. In light of recent successes in large scale audio pretraining, we
revisit the performance comparison between two-stage and end-to-end model and
find that audio based language models pretrained using weak self-supervised
objectives match or exceed the performance of similarly trained two-stage
models, and further, that the choice of pretraining objective substantially
effects a model's ability to be adapted to the disfluency removal task
On the Robustness of Arabic Speech Dialect Identification
Arabic dialect identification (ADI) tools are an important part of the
large-scale data collection pipelines necessary for training speech recognition
models. As these pipelines require application of ADI tools to potentially
out-of-domain data, we aim to investigate how vulnerable the tools may be to
this domain shift. With self-supervised learning (SSL) models as a starting
point, we evaluate transfer learning and direct classification from SSL
features. We undertake our evaluation under rich conditions, with a goal to
develop ADI systems from pretrained models and ultimately evaluate performance
on newly collected data. In order to understand what factors contribute to
model decisions, we carry out a careful human study of a subset of our data.
Our analysis confirms that domain shift is a major challenge for ADI models. We
also find that while self-training does alleviate this challenges, it may be
insufficient for realistic conditions
Leveraging Multilingual Self-Supervised Pretrained Models for Sequence-to-Sequence End-to-End Spoken Language Understanding
A number of methods have been proposed for End-to-End Spoken Language
Understanding (E2E-SLU) using pretrained models, however their evaluation often
lacks multilingual setup and tasks that require prediction of lexical fillers,
such as slot filling. In this work, we propose a unified method that integrates
multilingual pretrained speech and text models and performs E2E-SLU on six
datasets in four languages in a generative manner, including the prediction of
lexical fillers. We investigate how the proposed method can be improved by
pretraining on widely available speech recognition data using several training
objectives. Pretraining on 7000 hours of multilingual data allows us to
outperform the state-of-the-art ultimately on two SLU datasets and partly on
two more SLU datasets. Finally, we examine the cross-lingual capabilities of
the proposed model and improve on the best known result on the
PortMEDIA-Language dataset by almost half, achieving a Concept/Value Error Rate
of 23.65%.Comment: IEEE Workshop on Automatic Speech Recognition and Understanding
(ASRU) 202
A Study of Low-Resource Speech Commands Recognition based on Adversarial Reprogramming
In this study, we propose a novel adversarial reprogramming (AR) approach for
low-resource spoken command recognition (SCR), and build an AR-SCR system. The
AR procedure aims to modify the acoustic signals (from the target domain) to
repurpose a pretrained SCR model (from the source domain). To solve the label
mismatches between source and target domains, and further improve the stability
of AR, we propose a novel similarity-based label mapping technique to align
classes. In addition, the transfer learning (TL) technique is combined with the
original AR process to improve the model adaptation capability. We evaluate the
proposed AR-SCR system on three low-resource SCR datasets, including Arabic,
Lithuanian, and dysarthric Mandarin speech. Experimental results show that with
a pretrained AM trained on a large-scale English dataset, the proposed AR-SCR
system outperforms the current state-of-the-art results on Arabic and
Lithuanian speech commands datasets, with only a limited amount of training
data.Comment: Submitted to ICASSP 202
- …