150 research outputs found
Unsupervised Cross-lingual Image Captioning
Most recent image captioning works are conducted in English as the majority
of image-caption datasets are in English. However, there are a large amount of
non-native English speakers worldwide. Generating image captions in different
languages is worth exploring. In this paper, we present a novel unsupervised
method to generate image captions without using any caption corpus. Our method
relies on 1) a cross-lingual auto-encoding, which learns the scene graph
mapping function along with the scene graph encoders and sentence decoders on
machine translation parallel corpora, and 2) an unsupervised feature mapping,
which seeks to map the encoded scene graph features from image modality to
sentence modality. By leveraging cross-lingual auto-encoding, cross-modal
feature mapping, and adversarial learning, our method can learn an image
captioner to generate captions in different languages. We verify the
effectiveness of our proposed method on the Chinese image caption generation.
The comparisons against several baseline methods demonstrate the effectiveness
of our approach.Comment: 8 page
Neural Natural Language Generation: A Survey on Multilinguality, Multimodality, Controllability and Learning
Developing artificial learning systems that can understand and generate natural language has been one of the long-standing goals of artificial intelligence. Recent decades have witnessed an impressive progress on both of these problems, giving rise to a new family of approaches. Especially, the advances in deep learning over the past couple of years have led to neural approaches to natural language generation (NLG). These methods combine generative language learning techniques with neural-networks based frameworks. With a wide range of applications in natural language processing, neural NLG (NNLG) is a new and fast growing field of research. In this state-of-the-art report, we investigate the recent developments and applications of NNLG in its full extent from a multidimensional view, covering critical perspectives such as multimodality, multilinguality, controllability and learning strategies. We summarize the fundamental building blocks of NNLG approaches from these aspects and provide detailed reviews of commonly used preprocessing steps and basic neural architectures. This report also focuses on the seminal applications of these NNLG models such as machine translation, description generation, automatic speech recognition, abstractive summarization, text simplification, question answering and generation, and dialogue generation. Finally, we conclude with a thorough discussion of the described frameworks by pointing out some open research directions.This work has been partially supported by the European Commission ICT COST Action âMulti-task, Multilingual, Multi-modal Language Generationâ (CA18231). AE was supported by BAGEP 2021 Award of the Science Academy. EE was supported in part by TUBA GEBIP 2018 Award. BP is in in part funded by Independent Research Fund Denmark (DFF) grant 9063-00077B. IC has received funding from the European Unionâs Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 838188. EL is partly funded by Generalitat Valenciana and the Spanish Government throught projects PROMETEU/2018/089 and RTI2018-094649-B-I00, respectively. SMI is partly funded by UNIRI project uniri-drustv-18-20. GB is partly supported by the Ministry of Innovation and the National Research, Development and Innovation Office within the framework of the Hungarian Artificial Intelligence National Laboratory Programme. COT is partially funded by the Romanian Ministry of European Investments and Projects through the Competitiveness Operational Program (POC) project âHOLOTRAINâ (grant no. 29/221 ap2/07.04.2020, SMIS code: 129077) and by the German Academic Exchange Service (DAAD) through the project âAWAKEN: content-Aware and netWork-Aware faKE News mitigationâ (grant no. 91809005). ESA is partially funded by the German Academic Exchange Service (DAAD) through the project âDeep-Learning Anomaly Detection for Human and Automated Users Behaviorâ (grant no. 91809358)
Object-Centric Unsupervised Image Captioning
Image captioning is a longstanding problem in the field of computer vision
and natural language processing. To date, researchers have produced impressive
state-of-the-art performance in the age of deep learning. Most of these
state-of-the-art, however, requires large volume of annotated image-caption
pairs in order to train their models. When given an image dataset of interests,
practitioner needs to annotate the caption for each image in the training set
and this process needs to happen for each newly collected image dataset. In
this paper, we explore the task of unsupervised image captioning which utilizes
unpaired images and texts to train the model so that the texts can come from
different sources than the images. A main school of research on this topic that
has been shown to be effective is to construct pairs from the images and texts
in the training set according to their overlap of objects. Unlike in the
supervised setting, these constructed pairings are however not guaranteed to
have fully overlapping set of objects. Our work in this paper overcomes this by
harvesting objects corresponding to a given sentence from the training set,
even if they don't belong to the same image. When used as input to a
transformer, such mixture of objects enables larger if not full object
coverage, and when supervised by the corresponding sentence, produced results
that outperform current state of the art unsupervised methods by a significant
margin. Building upon this finding, we further show that (1) additional
information on relationship between objects and attributes of objects also
helps in boosting performance; and (2) our method also extends well to
non-English image captioning, which usually suffers from a scarcer level of
annotations. Our findings are supported by strong empirical results. Our code
is available at https://github.com/zihangm/obj-centric-unsup-caption.Comment: ECCV 202
Aligning Source Visual and Target Language Domains for Unpaired Video Captioning
Training supervised video captioning model requires coupled video-caption
pairs. However, for many targeted languages, sufficient paired data are not
available. To this end, we introduce the unpaired video captioning task aiming
to train models without coupled video-caption pairs in target language. To
solve the task, a natural choice is to employ a two-step pipeline system: first
utilizing video-to-pivot captioning model to generate captions in pivot
language and then utilizing pivot-to-target translation model to translate the
pivot captions to the target language. However, in such a pipeline system, 1)
visual information cannot reach the translation model, generating visual
irrelevant target captions; 2) the errors in the generated pivot captions will
be propagated to the translation model, resulting in disfluent target captions.
To address these problems, we propose the Unpaired Video Captioning with Visual
Injection system (UVC-VI). UVC-VI first introduces the Visual Injection Module
(VIM), which aligns source visual and target language domains to inject the
source visual information into the target language domain. Meanwhile, VIM
directly connects the encoder of the video-to-pivot model and the decoder of
the pivot-to-target model, allowing end-to-end inference by completely skipping
the generation of pivot captions. To enhance the cross-modality injection of
the VIM, UVC-VI further introduces a pluggable video encoder, i.e., Multimodal
Collaborative Encoder (MCE). The experiments show that UVC-VI outperforms
pipeline systems and exceeds several supervised systems. Furthermore, equipping
existing supervised systems with our MCE can achieve 4% and 7% relative margins
on the CIDEr scores to current state-of-the-art models on the benchmark MSVD
and MSR-VTT datasets, respectively.Comment: Published at IEEE Transactions on Pattern Analysis and Machine
Intelligence (TPAMI
Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation
This paper surveys the current state of the art in Natural Language
Generation (NLG), defined as the task of generating text or speech from
non-linguistic input. A survey of NLG is timely in view of the changes that the
field has undergone over the past decade or so, especially in relation to new
(usually data-driven) methods, as well as new applications of NLG technology.
This survey therefore aims to (a) give an up-to-date synthesis of research on
the core tasks in NLG and the architectures adopted in which such tasks are
organised; (b) highlight a number of relatively recent research topics that
have arisen partly as a result of growing synergies between NLG and other areas
of artificial intelligence; (c) draw attention to the challenges in NLG
evaluation, relating them to similar challenges faced in other areas of Natural
Language Processing, with an emphasis on different evaluation methods and the
relationships between them.Comment: Published in Journal of AI Research (JAIR), volume 61, pp 75-170. 118
pages, 8 figures, 1 tabl
MULE: Multimodal Universal Language Embedding
Existing vision-language methods typically support two languages at a time at
most. In this paper, we present a modular approach which can easily be
incorporated into existing vision-language methods in order to support many
languages. We accomplish this by learning a single shared Multimodal Universal
Language Embedding (MULE) which has been visually-semantically aligned across
all languages. Then we learn to relate MULE to visual data as if it were a
single language. Our method is not architecture specific, unlike prior work
which typically learned separate branches for each language, enabling our
approach to easily be adapted to many vision-language methods and tasks. Since
MULE learns a single language branch in the multimodal model, we can also scale
to support many languages, and languages with fewer annotations can take
advantage of the good representation learned from other (more abundant)
language data. We demonstrate the effectiveness of MULE on the bidirectional
image-sentence retrieval task, supporting up to four languages in a single
model. In addition, we show that Machine Translation can be used for data
augmentation in multilingual learning, which, combined with MULE, improves mean
recall by up to 21.9% on a single-language compared to prior work, with the
most significant gains seen on languages with relatively few annotations. Our
code is publicly available.Comment: Accepted as an oral at AAAI 202
- âŠ