105 research outputs found
Analyzing deep CNN-based utterance embeddings for acoustic model adaptation
We explore why deep convolutional neural networks (CNNs) with small
two-dimensional kernels, primarily used for modeling spatial relations in
images, are also effective in speech recognition. We analyze the
representations learned by deep CNNs and compare them with deep neural network
(DNN) representations and i-vectors, in the context of acoustic model
adaptation. To explore whether interpretable information can be decoded from
the learned representations we evaluate their ability to discriminate between
speakers, acoustic conditions, noise type, and gender using the Aurora-4
dataset. We extract both whole model embeddings (to capture the information
learned across the whole network) and layer-specific embeddings which enable
understanding of the flow of information across the network. We also use
learned representations as the additional input for a time-delay neural network
(TDNN) for the Aurora-4 and MGB-3 English datasets. We find that deep CNN
embeddings outperform DNN embeddings for acoustic model adaptation and
auxiliary features based on deep CNN embeddings result in similar word error
rates to i-vectors.Comment: accepted to SLT 201
What do End-to-End Speech Models Learn about Speaker, Language and Channel Information? A Layer-wise and Neuron-level Analysis
End-to-end DNN architectures have pushed the state-of-the-art in speech
technologies, as well as in other spheres of AI, leading researchers to train
more complex and deeper models. These improvements came at the cost of
transparency. DNNs are innately opaque and difficult to interpret. We no longer
understand what features are learned, where they are preserved, and how they
inter-operate. Such an analysis is important for better model understanding,
debugging and to ensure fairness in ethical decision making. In this work, we
analyze the representations trained within deep speech models, towards the task
of speaker recognition, dialect identification and reconstruction of masked
signals. We carry a layer- and neuron-level analysis on the utterance-level
representations captured within pretrained speech models for speaker, language
and channel properties. We study: is this information captured in the learned
representations? where is it preserved? how is it distributed? and can we
identify a minimal subset of network that posses this information. Using
diagnostic classifiers, we answered these questions. Our results reveal: (i)
channel and gender information is omnipresent and is redundantly distributed
(ii) complex properties such as dialectal information is encoded only in the
task-oriented pretrained network and is localised in the upper layers (iii) a
minimal subset of neurons can be extracted to encode the predefined property
(iv) salient neurons are sometimes shared between properties and can highlights
presence of biases in the network. Our cross-architectural comparison indicates
that (v) the pretrained models captures speaker-invariant information and (vi)
the pretrained CNNs models are competitive to the Transformers for encoding
information for the studied properties. To the best of our knowledge, this is
the first study to investigate neuron analysis on the speech models.Comment: Submitted to CSL. Keywords: Speech, Neuron Analysis,
Interpretibility, Diagnostic Classifier, AI explainability, End-to-End
Architectur
Zero-shot keyword spotting for visual speech recognition in-the-wild
Visual keyword spotting (KWS) is the problem of estimating whether a text
query occurs in a given recording using only video information. This paper
focuses on visual KWS for words unseen during training, a real-world, practical
setting which so far has received no attention by the community. To this end,
we devise an end-to-end architecture comprising (a) a state-of-the-art visual
feature extractor based on spatiotemporal Residual Networks, (b) a
grapheme-to-phoneme model based on sequence-to-sequence neural networks, and
(c) a stack of recurrent neural networks which learn how to correlate visual
features with the keyword representation. Different to prior works on KWS,
which try to learn word representations merely from sequences of graphemes
(i.e. letters), we propose the use of a grapheme-to-phoneme encoder-decoder
model which learns how to map words to their pronunciation. We demonstrate that
our system obtains very promising visual-only KWS results on the challenging
LRS2 database, for keywords unseen during training. We also show that our
system outperforms a baseline which addresses KWS via automatic speech
recognition (ASR), while it drastically improves over other recently proposed
ASR-free KWS methods.Comment: Accepted at ECCV-201
Multimodal Language Analysis with Recurrent Multistage Fusion
Computational modeling of human multimodal language is an emerging research
area in natural language processing spanning the language, visual and acoustic
modalities. Comprehending multimodal language requires modeling not only the
interactions within each modality (intra-modal interactions) but more
importantly the interactions between modalities (cross-modal interactions). In
this paper, we propose the Recurrent Multistage Fusion Network (RMFN) which
decomposes the fusion problem into multiple stages, each of them focused on a
subset of multimodal signals for specialized, effective fusion. Cross-modal
interactions are modeled using this multistage fusion approach which builds
upon intermediate representations of previous stages. Temporal and intra-modal
interactions are modeled by integrating our proposed fusion approach with a
system of recurrent neural networks. The RMFN displays state-of-the-art
performance in modeling human multimodal language across three public datasets
relating to multimodal sentiment analysis, emotion recognition, and speaker
traits recognition. We provide visualizations to show that each stage of fusion
focuses on a different subset of multimodal signals, learning increasingly
discriminative multimodal representations.Comment: EMNLP 201
- …