582 research outputs found
Lessons learned in multilingual grounded language learning
Recent work has shown how to learn better visual-semantic embeddings by
leveraging image descriptions in more than one language. Here, we investigate
in detail which conditions affect the performance of this type of grounded
language learning model. We show that multilingual training improves over
bilingual training, and that low-resource languages benefit from training with
higher-resource languages. We demonstrate that a multilingual model can be
trained equally well on either translations or comparable sentence pairs, and
that annotating the same set of images in multiple language enables further
improvements via an additional caption-caption ranking objective.Comment: CoNLL 201
Limitations of Cross-Lingual Learning from Image Search
Cross-lingual representation learning is an important step in making NLP
scale to all the world's languages. Recent work on bilingual lexicon induction
suggests that it is possible to learn cross-lingual representations of words
based on similarities between images associated with these words. However, that
work focused on the translation of selected nouns only. In our work, we
investigate whether the meaning of other parts-of-speech, in particular
adjectives and verbs, can be learned in the same way. We also experiment with
combining the representations learned from visual data with embeddings learned
from textual data. Our experiments across five language pairs indicate that
previous work does not scale to the problem of learning cross-lingual
representations beyond simple nouns
mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs
Modular vision-language models (Vision-LLMs) align pretrained image encoders
with (pretrained) large language models (LLMs), representing a computationally
much more efficient alternative to end-to-end training of large vision-language
models from scratch, which is prohibitively expensive for most. Vision-LLMs
instead post-hoc condition LLMs to `understand' the output of an image encoder.
With the abundance of readily available high-quality English image-text data as
well as monolingual English LLMs, the research focus has been on English-only
Vision-LLMs. Multilingual vision-language models are still predominantly
obtained via expensive end-to-end pretraining, resulting in comparatively
smaller models, trained on limited multilingual image data supplemented with
text-only multilingual corpora. In this work, we present mBLIP, the first
multilingual Vision-LLM, which we obtain in a computationally efficient manner
-- on consumer hardware using only a few million training examples -- by
leveraging a pretrained multilingual LLM. To this end, we \textit{re-align} an
image encoder previously tuned to an English LLM to a new, multilingual LLM --
for this, we leverage multilingual data from a mix of vision-and-language
tasks, which we obtain by machine-translating high-quality English data to 95
languages. On the IGLUE benchmark, mBLIP yields results competitive with
state-of-the-art models. Moreover, in image captioning on XM3600, mBLIP
(zero-shot) even outperforms PaLI-X (a model with 55B parameters). Compared to
these very large multilingual vision-language models trained from scratch, we
obtain mBLIP by training orders of magnitude fewer parameters on magnitudes
less data. We release our model and code at
\url{https://github.com/gregor-ge/mBLIP}
Translation-Enhanced Multilingual Text-to-Image Generation
Research on text-to-image generation (TTI) still predominantly focuses on the
English language due to the lack of annotated image-caption data in other
languages; in the long run, this might widen inequitable access to TTI
technology. In this work, we thus investigate multilingual TTI (termed mTTI)
and the current potential of neural machine translation (NMT) to bootstrap mTTI
systems. We provide two key contributions. 1) Relying on a multilingual
multi-modal encoder, we provide a systematic empirical study of standard
methods used in cross-lingual NLP when applied to mTTI: Translate Train,
Translate Test, and Zero-Shot Transfer. 2) We propose Ensemble Adapter (EnsAd),
a novel parameter-efficient approach that learns to weigh and consolidate the
multilingual text knowledge within the mTTI framework, mitigating the language
gap and thus improving mTTI performance. Our evaluations on standard mTTI
datasets COCO-CN, Multi30K Task2, and LAION-5B demonstrate the potential of
translation-enhanced mTTI systems and also validate the benefits of the
proposed EnsAd which derives consistent gains across all datasets. Further
investigations on model variants, ablation studies, and qualitative analyses
provide additional insights on the inner workings of the proposed mTTI
approaches.Comment: ACL 2023 (Main
MultiSubs: A Large-scale Multimodal and Multilingual Dataset
This paper introduces a large-scale multimodal and multilingual dataset that
aims to facilitate research on grounding words to images in their contextual
usage in language. The dataset consists of images selected to unambiguously
illustrate concepts expressed in sentences from movie subtitles. The dataset is
a valuable resource as (i) the images are aligned to text fragments rather than
whole sentences; (ii) multiple images are possible for a text fragment and a
sentence; (iii) the sentences are free-form and real-world like; (iv) the
parallel texts are multilingual. We set up a fill-in-the-blank game for humans
to evaluate the quality of the automatic image selection process of our
dataset. We show the utility of the dataset on two automatic tasks: (i)
fill-in-the blank; (ii) lexical translation. Results of the human evaluation
and automatic models demonstrate that images can be a useful complement to the
textual context. The dataset will benefit research on visual grounding of words
especially in the context of free-form sentences, and can be obtained from
https://doi.org/10.5281/zenodo.5034604 under a Creative Commons licence.Comment: Manuscript update: (i) Added links to the dataset and evaluation
toolkit; (ii) Section 6.1.4: Added random and n-gram baselines to the
fill-in-the-blank task, and added further discussion at the end of the
section; (iii) Section 6.2.3: Further elaboration on the ALI metric; (iv)
Section 6.2.4: Corrected results for the lexical translation task (Table 8),
and updated the discussions accordingl
Unsupervised Bilingual Lexicon Induction from Mono-lingual Multimodal Data
Bilingual lexicon induction, translating words from the source language to
the target language, is a long-standing natural language processing task.
Recent endeavors prove that it is promising to employ images as pivot to learn
the lexicon induction without reliance on parallel corpora. However, these
vision-based approaches simply associate words with entire images, which are
constrained to translate concrete words and require object-centered images. We
humans can understand words better when they are within a sentence with
context. Therefore, in this paper, we propose to utilize images and their
associated captions to address the limitations of previous approaches. We
propose a multi-lingual caption model trained with different mono-lingual
multimodal data to map words in different languages into joint spaces. Two
types of word representation are induced from the multi-lingual caption model:
linguistic features and localized visual features. The linguistic feature is
learned from the sentence contexts with visual semantic constraints, which is
beneficial to learn translation for words that are less visual-relevant. The
localized visual feature is attended to the region in the image that correlates
to the word, so that it alleviates the image restriction for salient visual
representation. The two types of features are complementary for word
translation. Experimental results on multiple language pairs demonstrate the
effectiveness of our proposed method, which substantially outperforms previous
vision-based approaches without using any parallel sentences or supervision of
seed word pairs.Comment: Accepted by AAAI 201
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
The milestone improvements brought about by deep representation learning and
pre-training techniques have led to large performance gains across downstream
NLP, IR and Vision tasks. Multimodal modeling techniques aim to leverage large
high-quality visio-linguistic datasets for learning complementary information
(across image and text modalities). In this paper, we introduce the
Wikipedia-based Image Text (WIT) Dataset
(https://github.com/google-research-datasets/wit) to better facilitate
multimodal, multilingual learning. WIT is composed of a curated set of 37.6
million entity rich image-text examples with 11.5 million unique images across
108 Wikipedia languages. Its size enables WIT to be used as a pretraining
dataset for multimodal models, as we show when applied to downstream tasks such
as image-text retrieval. WIT has four main and unique advantages. First, WIT is
the largest multimodal dataset by the number of image-text examples by 3x (at
the time of writing). Second, WIT is massively multilingual (first of its kind)
with coverage over 100+ languages (each of which has at least 12K examples) and
provides cross-lingual texts for many images. Third, WIT represents a more
diverse set of concepts and real world entities relative to what previous
datasets cover. Lastly, WIT provides a very challenging real-world test set, as
we empirically illustrate using an image-text retrieval task as an example
- …