973 research outputs found
Encoding of phonology in a recurrent neural model of grounded speech
We study the representation and encoding of phonemes in a recurrent neural
network model of grounded speech. We use a model which processes images and
their spoken descriptions, and projects the visual and auditory representations
into the same semantic space. We perform a number of analyses on how
information about individual phonemes is encoded in the MFCC features extracted
from the speech signal, and the activations of the layers of the model. Via
experiments with phoneme decoding and phoneme discrimination we show that
phoneme representations are most salient in the lower layers of the model,
where low-level signals are processed at a fine-grained level, although a large
amount of phonological information is retain at the top recurrent layer. We
further find out that the attention mechanism following the top recurrent layer
significantly attenuates encoding of phonology and makes the utterance
embeddings much more invariant to synonymy. Moreover, a hierarchical clustering
of phoneme representations learned by the network shows an organizational
structure of phonemes similar to those proposed in linguistics.Comment: Accepted at CoNLL 201
Analyzing analytical methods: The case of phonology in neural models of spoken language
Given the fast development of analysis techniques for NLP and speech
processing systems, few systematic studies have been conducted to compare the
strengths and weaknesses of each method. As a step in this direction we study
the case of representations of phonology in neural network models of spoken
language. We use two commonly applied analytical techniques, diagnostic
classifiers and representational similarity analysis, to quantify to what
extent neural activation patterns encode phonemes and phoneme sequences. We
manipulate two factors that can affect the outcome of analysis. First, we
investigate the role of learning by comparing neural activations extracted from
trained versus randomly-initialized models. Second, we examine the temporal
scope of the activations by probing both local activations corresponding to a
few milliseconds of the speech signal, and global activations pooled over the
whole utterance. We conclude that reporting analysis results with randomly
initialized models is crucial, and that global-scope methods tend to yield more
consistent results and we recommend their use as a complement to local-scope
diagnostic methods.Comment: ACL 202
Symbolic inductive bias for visually grounded learning of spoken language
A widespread approach to processing spoken language is to first automatically
transcribe it into text. An alternative is to use an end-to-end approach:
recent works have proposed to learn semantic embeddings of spoken language from
images with spoken captions, without an intermediate transcription step. We
propose to use multitask learning to exploit existing transcribed speech within
the end-to-end setting. We describe a three-task architecture which combines
the objectives of matching spoken captions with corresponding images, speech
with text, and text with images. We show that the addition of the speech/text
task leads to substantial performance improvements on image retrieval when
compared to training the speech/image task in isolation. We conjecture that
this is due to a strong inductive bias transcribed speech provides to the
model, and offer supporting evidence for this.Comment: ACL 201
Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems
Neural models have become ubiquitous in automatic speech recognition systems.
While neural networks are typically used as acoustic models in more complex
systems, recent studies have explored end-to-end speech recognition systems
based on neural networks, which can be trained to directly predict text from
input acoustic features. Although such systems are conceptually elegant and
simpler than traditional systems, it is less obvious how to interpret the
trained models. In this work, we analyze the speech representations learned by
a deep end-to-end model that is based on convolutional and recurrent layers,
and trained with a connectionist temporal classification (CTC) loss. We use a
pre-trained model to generate frame-level features which are given to a
classifier that is trained on frame classification into phones. We evaluate
representations from different layers of the deep model and compare their
quality for predicting phone labels. Our experiments shed light on important
aspects of the end-to-end model such as layer depth, model complexity, and
other design choices.Comment: NIPS 201
Wave to Syntax: Probing spoken language models for syntax
Understanding which information is encoded in deep models of spoken and
written language has been the focus of much research in recent years, as it is
crucial for debugging and improving these architectures. Most previous work has
focused on probing for speaker characteristics, acoustic and phonological
information in models of spoken language, and for syntactic information in
models of written language. Here we focus on the encoding of syntax in several
self-supervised and visually grounded models of spoken language. We employ two
complementary probing methods, combined with baselines and reference
representations to quantify the degree to which syntactic structure is encoded
in the activations of the target models. We show that syntax is captured most
prominently in the middle layers of the networks, and more explicitly within
models with more parameters.Comment: Accepted to Interspeech 202
From Phonology to Syntax:Unsupervised Linguistic Typology at Different Levels with Language Embeddings
A core part of linguistic typology is the classification of languages
according to linguistic properties, such as those detailed in the World Atlas
of Language Structure (WALS). Doing this manually is prohibitively
time-consuming, which is in part evidenced by the fact that only 100 out of
over 7,000 languages spoken in the world are fully covered in WALS.
We learn distributed language representations, which can be used to predict
typological properties on a massively multilingual scale. Additionally,
quantitative and qualitative analyses of these language embeddings can tell us
how language similarities are encoded in NLP models for tasks at different
typological levels. The representations are learned in an unsupervised manner
alongside tasks at three typological levels: phonology (grapheme-to-phoneme
prediction, and phoneme reconstruction), morphology (morphological inflection),
and syntax (part-of-speech tagging).
We consider more than 800 languages and find significant differences in the
language representations encoded, depending on the target task. For instance,
although Norwegian Bokm{\aa}l and Danish are typologically close to one
another, they are phonologically distant, which is reflected in their language
embeddings growing relatively distant in a phonological task. We are also able
to predict typological features in WALS with high accuracies, even for unseen
language families.Comment: Accepted to NAACL 2018 (long paper). arXiv admin note: text overlap
with arXiv:1711.0546
- âŠ