42 research outputs found
ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for Understanding and Generation
Recent cross-lingual cross-modal works attempt to extend Vision-Language
Pre-training (VLP) models to non-English inputs and achieve impressive
performance. However, these models focus only on understanding tasks utilizing
encoder-only architecture. In this paper, we propose ERNIE-UniX2, a unified
cross-lingual cross-modal pre-training framework for both generation and
understanding tasks. ERNIE-UniX2 integrates multiple pre-training paradigms
(e.g., contrastive learning and language modeling) based on encoder-decoder
architecture and attempts to learn a better joint representation across
languages and modalities. Furthermore, ERNIE-UniX2 can be seamlessly fine-tuned
for varieties of generation and understanding downstream tasks. Pre-trained on
both multilingual text-only and image-text datasets, ERNIE-UniX2 achieves SOTA
results on various cross-lingual cross-modal generation and understanding tasks
such as multimodal machine translation and multilingual visual question
answering.Comment: 13 pages, 2 figure
Unsupervised Cross-lingual Image Captioning
Most recent image captioning works are conducted in English as the majority
of image-caption datasets are in English. However, there are a large amount of
non-native English speakers worldwide. Generating image captions in different
languages is worth exploring. In this paper, we present a novel unsupervised
method to generate image captions without using any caption corpus. Our method
relies on 1) a cross-lingual auto-encoding, which learns the scene graph
mapping function along with the scene graph encoders and sentence decoders on
machine translation parallel corpora, and 2) an unsupervised feature mapping,
which seeks to map the encoded scene graph features from image modality to
sentence modality. By leveraging cross-lingual auto-encoding, cross-modal
feature mapping, and adversarial learning, our method can learn an image
captioner to generate captions in different languages. We verify the
effectiveness of our proposed method on the Chinese image caption generation.
The comparisons against several baseline methods demonstrate the effectiveness
of our approach.Comment: 8 page
Object-Centric Unsupervised Image Captioning
Image captioning is a longstanding problem in the field of computer vision
and natural language processing. To date, researchers have produced impressive
state-of-the-art performance in the age of deep learning. Most of these
state-of-the-art, however, requires large volume of annotated image-caption
pairs in order to train their models. When given an image dataset of interests,
practitioner needs to annotate the caption for each image in the training set
and this process needs to happen for each newly collected image dataset. In
this paper, we explore the task of unsupervised image captioning which utilizes
unpaired images and texts to train the model so that the texts can come from
different sources than the images. A main school of research on this topic that
has been shown to be effective is to construct pairs from the images and texts
in the training set according to their overlap of objects. Unlike in the
supervised setting, these constructed pairings are however not guaranteed to
have fully overlapping set of objects. Our work in this paper overcomes this by
harvesting objects corresponding to a given sentence from the training set,
even if they don't belong to the same image. When used as input to a
transformer, such mixture of objects enables larger if not full object
coverage, and when supervised by the corresponding sentence, produced results
that outperform current state of the art unsupervised methods by a significant
margin. Building upon this finding, we further show that (1) additional
information on relationship between objects and attributes of objects also
helps in boosting performance; and (2) our method also extends well to
non-English image captioning, which usually suffers from a scarcer level of
annotations. Our findings are supported by strong empirical results. Our code
is available at https://github.com/zihangm/obj-centric-unsup-caption.Comment: ECCV 202
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
The milestone improvements brought about by deep representation learning and
pre-training techniques have led to large performance gains across downstream
NLP, IR and Vision tasks. Multimodal modeling techniques aim to leverage large
high-quality visio-linguistic datasets for learning complementary information
(across image and text modalities). In this paper, we introduce the
Wikipedia-based Image Text (WIT) Dataset
(https://github.com/google-research-datasets/wit) to better facilitate
multimodal, multilingual learning. WIT is composed of a curated set of 37.6
million entity rich image-text examples with 11.5 million unique images across
108 Wikipedia languages. Its size enables WIT to be used as a pretraining
dataset for multimodal models, as we show when applied to downstream tasks such
as image-text retrieval. WIT has four main and unique advantages. First, WIT is
the largest multimodal dataset by the number of image-text examples by 3x (at
the time of writing). Second, WIT is massively multilingual (first of its kind)
with coverage over 100+ languages (each of which has at least 12K examples) and
provides cross-lingual texts for many images. Third, WIT represents a more
diverse set of concepts and real world entities relative to what previous
datasets cover. Lastly, WIT provides a very challenging real-world test set, as
we empirically illustrate using an image-text retrieval task as an example
X-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
Vision language pre-training aims to learn alignments between vision and
language from a large amount of data. We proposed multi-grained vision language
pre-training, a unified approach which can learn vision language alignments in
multiple granularity. This paper advances the proposed method by unifying image
and video encoding in one model and scaling up the model with large-scale data.
We present X-VLM, a pre-trained VLM with a modular architecture for both
image-text tasks and video-text tasks. Experiment results show that X-VLM
performs the best on base and large scale for both image-text and video-text
tasks, making a good trade-off between performance and model scale. Moreover,
we show that the modular design of X-VLM results in high transferability
for X-VLM to be utilized in any language or domain. For example, by simply
replacing the text encoder with XLM-R, X-VLM outperforms state-of-the-art
multilingual multi-modal pre-trained models without any multilingual
pre-training. The code and pre-trained models will be available at
github.com/zengyan-97/X2-VLM.Comment: 21 pages, 8 figure