11 research outputs found
Learning semantic sentence representations from visually grounded language without lexical knowledge
Current approaches to learning semantic representations of sentences often
use prior word-level knowledge. The current study aims to leverage visual
information in order to capture sentence level semantics without the need for
word embeddings. We use a multimodal sentence encoder trained on a corpus of
images with matching text captions to produce visually grounded sentence
embeddings. Deep Neural Networks are trained to map the two modalities to a
common embedding space such that for an image the corresponding caption can be
retrieved and vice versa. We show that our model achieves results comparable to
the current state-of-the-art on two popular image-caption retrieval benchmark
data sets: MSCOCO and Flickr8k. We evaluate the semantic content of the
resulting sentence embeddings using the data from the Semantic Textual
Similarity benchmark task and show that the multimodal embeddings correlate
well with human semantic similarity judgements. The system achieves
state-of-the-art results on several of these benchmarks, which shows that a
system trained solely on multimodal data, without assuming any word
representations, is able to capture sentence level semantics. Importantly, this
result shows that we do not need prior knowledge of lexical level semantics in
order to model sentence level semantics. These findings demonstrate the
importance of visual information in semantics
Comparing Transformers and RNNs on predicting human sentence processing data
Recurrent neural networks (RNNs) have long been an architecture of interest
for computational models of human sentence processing. The more recently
introduced Transformer architecture has been shown to outperform recurrent
neural networks on many natural language processing tasks but little is known
about their ability to model human language processing. It has long been
thought that human sentence reading involves something akin to recurrence and
so RNNs may still have an advantage over the Transformer as a cognitive model.
In this paper we train both Transformer and RNN based language models and
compare their performance as a model of human sentence processing. We use the
trained language models to compute surprisal values for the stimuli used in
several reading experiments and use mixed linear modelling to measure how well
the surprisal explains measures of human reading effort. Our analysis shows
that the Transformers outperform the RNNs as cognitive models in explaining
self-paced reading times and N400 strength but not gaze durations from an
eye-tracking experiment
Seeing the advantage: visually grounding word embeddings to better capture human semantic knowledge
Distributional semantic models capture word-level meaning that is useful in
many natural language processing tasks and have even been shown to capture
cognitive aspects of word meaning. The majority of these models are purely text
based, even though the human sensory experience is much richer. In this paper
we create visually grounded word embeddings by combining English text and
images and compare them to popular text-based methods, to see if visual
information allows our model to better capture cognitive aspects of word
meaning. Our analysis shows that visually grounded embedding similarities are
more predictive of the human reaction times in a large priming experiment than
the purely text-based embeddings. The visually grounded embeddings also
correlate well with human word similarity ratings. Importantly, in both
experiments we show that the grounded embeddings account for a unique portion
of explained variance, even when we include text-based embeddings trained on
huge corpora. This shows that visual grounding allows our model to capture
information that cannot be extracted using text as the only source of
information
Semantic sentence similarity: size does not always matter
This study addresses the question whether visually grounded speech
recognition (VGS) models learn to capture sentence semantics without access to
any prior linguistic knowledge. We produce synthetic and natural spoken
versions of a well known semantic textual similarity database and show that our
VGS model produces embeddings that correlate well with human semantic
similarity judgements. Our results show that a model trained on a small
image-caption database outperforms two models trained on much larger databases,
indicating that database size is not all that matters. We also investigate the
importance of having multiple captions per image and find that this is indeed
helpful even if the total number of images is lower, suggesting that
paraphrasing is a valuable learning signal. While the general trend in the
field is to create ever larger datasets to train models on, our findings
indicate other characteristics of the database can just as important important.Comment: This paper has been accepted at Interspeech 2021 where it will be
presented and appear in the conference proceedings in September 202
Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop
We summarize the accomplishments of a multi-disciplinary workshop exploring
the computational and scientific issues surrounding the discovery of linguistic
units (subwords and words) in a language without orthography. We study the
replacement of orthographic transcriptions by images and/or translated text in
a well-resourced language to help unsupervised discovery from raw speech.Comment: Accepted to ICASSP 201
Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the âSpeaking rosettaâ JSALT 2017 workshop
International audienceWe summarize the accomplishments of a multidisciplinary workshop exploring the computational and scientific issues surrounding the discovery of linguistic units (subwords and words) in a language without orthography. We study the replacement of orthographic transcriptions by images and/or translated text in a well-resourced language to help unsuper-vised discovery from raw speech
The Role of Articulatory Feature Representation Quality in a Computational Model of Human Spoken-Word Recognition
Fine-Tracker is a speech-based model of human speech recognition. While previous work has shown that Fine-Tracker is successful at modelling aspects of human spoken-word recognition, its speech recognition performance is not comparable to that of human performance, possibly due to suboptimal intermediate articulatory feature (AF) representations. This study investigates the effect of improved AF representations, obtained using a state-of-the-art deep convolutional network, on Fine-Trackerâs simulation and recognition performance: Although the improved AF quality resulted in improved speech recognition; it, surprisingly, did not lead to an improvement in Fine-Trackerâs simulation power.Multimedia Computin
Learning to recognise words using visually grounded speech
We investigated word recognition in a Visually Grounded Speech model. The model has been trained on pairs of images and spoken captions to create visually grounded embeddings which can be used for speech to image retrieval and vice versa. We investigate whether such a model can be used to recognise words by embedding isolated words and using them to retrieve images of their visual referents. We investigate the time-course of word recognition using a gating paradigm and perform a statistical analysis to see whether well known word competition effects in human speech processing influence word recognition. Our experiments show that the model is able to recognise words, and the gating paradigm reveals that words can be recognised from partial input as well and that recognition is negatively influenced by word competition from the word initial cohort.Accepted author manuscriptMultimedia Computin
Speech technology for unwritten languages
International audienceSpeech technology plays an important role in our everyday life. Speech is, among others, used for human-computer interaction, including, for instance, information retrieval and on-line shopping. In the case of an unwritten language, however, speech technology is unfortunately difficult to create, because it cannot be created by the standard combination of pre-trained speech-to-text and text-to-speech subsystems. The research presented in this paper takes the first steps towards speech technology for unwritten languages. Specifically, the aim of this work was 1) to learn speech-to-meaning representations without using text as an intermediate representation, and 2) to test the sufficiency of the learned representations to regenerate speech or translated text, or to retrieve images that depict the meaning of an utterance in an unwritten language. The results suggest that building systems that go directly from speech-to-meaning and from meaning-to-speech, bypassing the need for text, is possible