35 research outputs found
Unsupervised Cross-lingual Image Captioning
Most recent image captioning works are conducted in English as the majority
of image-caption datasets are in English. However, there are a large amount of
non-native English speakers worldwide. Generating image captions in different
languages is worth exploring. In this paper, we present a novel unsupervised
method to generate image captions without using any caption corpus. Our method
relies on 1) a cross-lingual auto-encoding, which learns the scene graph
mapping function along with the scene graph encoders and sentence decoders on
machine translation parallel corpora, and 2) an unsupervised feature mapping,
which seeks to map the encoded scene graph features from image modality to
sentence modality. By leveraging cross-lingual auto-encoding, cross-modal
feature mapping, and adversarial learning, our method can learn an image
captioner to generate captions in different languages. We verify the
effectiveness of our proposed method on the Chinese image caption generation.
The comparisons against several baseline methods demonstrate the effectiveness
of our approach.Comment: 8 page
Object-Centric Unsupervised Image Captioning
Image captioning is a longstanding problem in the field of computer vision
and natural language processing. To date, researchers have produced impressive
state-of-the-art performance in the age of deep learning. Most of these
state-of-the-art, however, requires large volume of annotated image-caption
pairs in order to train their models. When given an image dataset of interests,
practitioner needs to annotate the caption for each image in the training set
and this process needs to happen for each newly collected image dataset. In
this paper, we explore the task of unsupervised image captioning which utilizes
unpaired images and texts to train the model so that the texts can come from
different sources than the images. A main school of research on this topic that
has been shown to be effective is to construct pairs from the images and texts
in the training set according to their overlap of objects. Unlike in the
supervised setting, these constructed pairings are however not guaranteed to
have fully overlapping set of objects. Our work in this paper overcomes this by
harvesting objects corresponding to a given sentence from the training set,
even if they don't belong to the same image. When used as input to a
transformer, such mixture of objects enables larger if not full object
coverage, and when supervised by the corresponding sentence, produced results
that outperform current state of the art unsupervised methods by a significant
margin. Building upon this finding, we further show that (1) additional
information on relationship between objects and attributes of objects also
helps in boosting performance; and (2) our method also extends well to
non-English image captioning, which usually suffers from a scarcer level of
annotations. Our findings are supported by strong empirical results. Our code
is available at https://github.com/zihangm/obj-centric-unsup-caption.Comment: ECCV 202
Aligning Source Visual and Target Language Domains for Unpaired Video Captioning
Training supervised video captioning model requires coupled video-caption
pairs. However, for many targeted languages, sufficient paired data are not
available. To this end, we introduce the unpaired video captioning task aiming
to train models without coupled video-caption pairs in target language. To
solve the task, a natural choice is to employ a two-step pipeline system: first
utilizing video-to-pivot captioning model to generate captions in pivot
language and then utilizing pivot-to-target translation model to translate the
pivot captions to the target language. However, in such a pipeline system, 1)
visual information cannot reach the translation model, generating visual
irrelevant target captions; 2) the errors in the generated pivot captions will
be propagated to the translation model, resulting in disfluent target captions.
To address these problems, we propose the Unpaired Video Captioning with Visual
Injection system (UVC-VI). UVC-VI first introduces the Visual Injection Module
(VIM), which aligns source visual and target language domains to inject the
source visual information into the target language domain. Meanwhile, VIM
directly connects the encoder of the video-to-pivot model and the decoder of
the pivot-to-target model, allowing end-to-end inference by completely skipping
the generation of pivot captions. To enhance the cross-modality injection of
the VIM, UVC-VI further introduces a pluggable video encoder, i.e., Multimodal
Collaborative Encoder (MCE). The experiments show that UVC-VI outperforms
pipeline systems and exceeds several supervised systems. Furthermore, equipping
existing supervised systems with our MCE can achieve 4% and 7% relative margins
on the CIDEr scores to current state-of-the-art models on the benchmark MSVD
and MSR-VTT datasets, respectively.Comment: Published at IEEE Transactions on Pattern Analysis and Machine
Intelligence (TPAMI
Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications
Multimodality Representation Learning, as a technique of learning to embed
information from different modalities and their correlations, has achieved
remarkable success on a variety of applications, such as Visual Question
Answering (VQA), Natural Language for Visual Reasoning (NLVR), and Vision
Language Retrieval (VLR). Among these applications, cross-modal interaction and
complementary information from different modalities are crucial for advanced
models to perform any multimodal task, e.g., understand, recognize, retrieve,
or generate optimally. Researchers have proposed diverse methods to address
these tasks. The different variants of transformer-based architectures
performed extraordinarily on multiple modalities. This survey presents the
comprehensive literature on the evolution and enhancement of deep learning
multimodal architectures to deal with textual, visual and audio features for
diverse cross-modal and modern multimodal tasks. This study summarizes the (i)
recent task-specific deep learning methodologies, (ii) the pretraining types
and multimodal pretraining objectives, (iii) from state-of-the-art pretrained
multimodal approaches to unifying architectures, and (iv) multimodal task
categories and possible future improvements that can be devised for better
multimodal learning. Moreover, we prepare a dataset section for new researchers
that covers most of the benchmarks for pretraining and finetuning. Finally,
major challenges, gaps, and potential research topics are explored. A
constantly-updated paperlist related to our survey is maintained at
https://github.com/marslanm/multimodality-representation-learning
Enforcing constraints for multi-lingual and cross-lingual speech-to-text systems
The recent development of neural network-based automatic speech recognition (ASR) systems has greatly reduced the state-of-the-art phone error rates in several languages. However, when an ASR system trained on one language tries to recognize speech from another language, such a system usually fails, even when the two languages come from the same language family. The above scenario poses a problem for low-resource languages. Such languages usually do not have enough paired data for training a moderately-sized ASR model and thus require either cross-lingual adaptation or zero-shot recognition.
Due to the increasing interest in bringing ASR technology to low-resource languages, the cross-lingual adaptation of end-to-end speech recognition systems has recently received more attention. However, little analysis has been done to understand how the model learns a shared representation across languages and how language-dependent representations can be fine-tuned to improve the system’s performance. We compare a bi-lingual CTC model with language-specific tuning at earlier LSTM layers to one without such tuning. This is to understand if having language-independent pathways in the model helps with multi-lingual learning and why. We first train the network on Dutch and then transfer the system to English under the bi-lingual CTC loss. After that, the representations from the two networks are visualized. Results showed that the consonants of the two languages are learned very well under a shared mapping but that vowels could benefit significantly when further language-dependent transformations are applied before the last classification layer. These results can be used as a guide for designing multilingual and cross-lingual end-to-end systems in the future.
However, creating specialized processing units in the neural network for each training language could yield increasingly large networks as the number of training languages increases. It is also unclear how to adapt such a system to zero-shot recognition. The remaining work adapts two existing constraints to the realm of multi-lingual and cross-lingual ASR. The first constraint is cycle-consistent training. This method defines a shared codebook of phonetic tokens for all training languages. Input speech first passes through the speech encoder of the ASR system and gets quantized into discrete representations from the codebook. The discrete sequence representation is then passed through an auxiliary speech decoder to reconstruct the input speech. The framework constrains the reconstructed speech to be close to the original input speech. The second constraint is regret minimization training. It separates an ASR encoder into two parts: a feature extractor and a predictor. Regret minimization defines an additional regret term for each training sample as the difference between the losses of an auxiliary language-specific predictor with the real language I.D. and a fake language I.D. This constraint enables the feature extractor to learn an invariant speech-to-phone mapping across all languages and could potentially improve the model's generalization ability to new languages