20 research outputs found
Evaluating the Representational Hub of Language and Vision Models
The multimodal models used in the emerging field at the intersection of
computational linguistics and computer vision implement the bottom-up
processing of the `Hub and Spoke' architecture proposed in cognitive science to
represent how the brain processes and combines multi-sensory inputs. In
particular, the Hub is implemented as a neural network encoder. We investigate
the effect on this encoder of various vision-and-language tasks proposed in the
literature: visual question answering, visual reference resolution, and
visually grounded dialogue. To measure the quality of the representations
learned by the encoder, we use two kinds of analyses. First, we evaluate the
encoder pre-trained on the different vision-and-language tasks on an existing
diagnostic task designed to assess multimodal semantic understanding. Second,
we carry out a battery of analyses aimed at studying how the encoder merges and
exploits the two modalities.Comment: Accepted to IWCS 201
MuST-Cinema: a Speech-to-Subtitles corpus
Growing needs in localising audiovisual content in multiple languages through
subtitles call for the development of automatic solutions for human subtitling.
Neural Machine Translation (NMT) can contribute to the automatisation of
subtitling, facilitating the work of human subtitlers and reducing turn-around
times and related costs. NMT requires high-quality, large, task-specific
training data. The existing subtitling corpora, however, are missing both
alignments to the source language audio and important information about
subtitle breaks. This poses a significant limitation for developing efficient
automatic approaches for subtitling, since the length and form of a subtitle
directly depends on the duration of the utterance. In this work, we present
MuST-Cinema, a multilingual speech translation corpus built from TED subtitles.
The corpus is comprised of (audio, transcription, translation) triplets.
Subtitle breaks are preserved by inserting special symbols. We show that the
corpus can be used to build models that efficiently segment sentences into
subtitles and propose a method for annotating existing subtitling corpora with
subtitle breaks, conforming to the constraint of length.Comment: Accepted at LREC 202
Emergent Communication Pretraining for Few-Shot Machine Translation
While state-of-the-art models that rely upon massively multilingual
pretrained encoders achieve sample efficiency in downstream applications, they
still require abundant amounts of unlabelled text. Nevertheless, most of the
world's languages lack such resources. Hence, we investigate a more radical
form of unsupervised knowledge transfer in the absence of linguistic data. In
particular, for the first time we pretrain neural networks via emergent
communication from referential games. Our key assumption is that grounding
communication on images---as a crude approximation of real-world
environments---inductively biases the model towards learning natural languages.
On the one hand, we show that this substantially benefits machine translation
in few-shot settings. On the other hand, this also provides an extrinsic
evaluation protocol to probe the properties of emergent languages ex vitro.
Intuitively, the closer they are to natural languages, the higher the gains
from pretraining on them should be. For instance, in this work we measure the
influence of communication success and maximum sequence length on downstream
performances. Finally, we introduce a customised adapter layer and annealing
strategies for the regulariser of maximum-a-posteriori inference during
fine-tuning. These turn out to be crucial to facilitate knowledge transfer and
prevent catastrophic forgetting. Compared to a recurrent baseline, our method
yields gains of in BLEU score with only NMT
training instances and with NMT training
instances across four language pairs. These proof-of-concept results reveal the
potential of emergent communication pretraining for both natural language
processing tasks in resource-poor settings and extrinsic evaluation of
artificial languages
Dynamic Context-guided Capsule Network for Multimodal Machine Translation
Multimodal machine translation (MMT), which mainly focuses on enhancing
text-only translation with visual features, has attracted considerable
attention from both computer vision and natural language processing
communities. Most current MMT models resort to attention mechanism, global
context modeling or multimodal joint representation learning to utilize visual
features. However, the attention mechanism lacks sufficient semantic
interactions between modalities while the other two provide fixed visual
context, which is unsuitable for modeling the observed variability when
generating translation. To address the above issues, in this paper, we propose
a novel Dynamic Context-guided Capsule Network (DCCN) for MMT. Specifically, at
each timestep of decoding, we first employ the conventional source-target
attention to produce a timestep-specific source-side context vector. Next, DCCN
takes this vector as input and uses it to guide the iterative extraction of
related visual features via a context-guided dynamic routing mechanism.
Particularly, we represent the input image with global and regional visual
features, we introduce two parallel DCCNs to model multimodal context vectors
with visual features at different granularities. Finally, we obtain two
multimodal context vectors, which are fused and incorporated into the decoder
for the prediction of the target word. Experimental results on the Multi30K
dataset of English-to-German and English-to-French translation demonstrate the
superiority of DCCN. Our code is available on
https://github.com/DeepLearnXMU/MM-DCCN
MULE: Multimodal Universal Language Embedding
Existing vision-language methods typically support two languages at a time at
most. In this paper, we present a modular approach which can easily be
incorporated into existing vision-language methods in order to support many
languages. We accomplish this by learning a single shared Multimodal Universal
Language Embedding (MULE) which has been visually-semantically aligned across
all languages. Then we learn to relate MULE to visual data as if it were a
single language. Our method is not architecture specific, unlike prior work
which typically learned separate branches for each language, enabling our
approach to easily be adapted to many vision-language methods and tasks. Since
MULE learns a single language branch in the multimodal model, we can also scale
to support many languages, and languages with fewer annotations can take
advantage of the good representation learned from other (more abundant)
language data. We demonstrate the effectiveness of MULE on the bidirectional
image-sentence retrieval task, supporting up to four languages in a single
model. In addition, we show that Machine Translation can be used for data
augmentation in multilingual learning, which, combined with MULE, improves mean
recall by up to 21.9% on a single-language compared to prior work, with the
most significant gains seen on languages with relatively few annotations. Our
code is publicly available.Comment: Accepted as an oral at AAAI 202
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
The milestone improvements brought about by deep representation learning and
pre-training techniques have led to large performance gains across downstream
NLP, IR and Vision tasks. Multimodal modeling techniques aim to leverage large
high-quality visio-linguistic datasets for learning complementary information
(across image and text modalities). In this paper, we introduce the
Wikipedia-based Image Text (WIT) Dataset
(https://github.com/google-research-datasets/wit) to better facilitate
multimodal, multilingual learning. WIT is composed of a curated set of 37.6
million entity rich image-text examples with 11.5 million unique images across
108 Wikipedia languages. Its size enables WIT to be used as a pretraining
dataset for multimodal models, as we show when applied to downstream tasks such
as image-text retrieval. WIT has four main and unique advantages. First, WIT is
the largest multimodal dataset by the number of image-text examples by 3x (at
the time of writing). Second, WIT is massively multilingual (first of its kind)
with coverage over 100+ languages (each of which has at least 12K examples) and
provides cross-lingual texts for many images. Third, WIT represents a more
diverse set of concepts and real world entities relative to what previous
datasets cover. Lastly, WIT provides a very challenging real-world test set, as
we empirically illustrate using an image-text retrieval task as an example