35 research outputs found
Word-Region Alignment-Guided Multimodal Neural Machine Translation
We propose word-region alignment-guided multimodal neural machine translation (MNMT), a novel model for MNMT that links the semantic correlation between textual and visual modalities using word-region alignment (WRA). Existing studies on MNMT have mainly focused on the effect of integrating visual and textual modalities. However, they do not leverage the semantic relevance between the two modalities. We advance the semantic correlation between textual and visual modalities in MNMT by incorporating WRA as a bridge. This proposal has been implemented on two mainstream architectures of neural machine translation (NMT): the recurrent neural network (RNN) and the transformer. Experiments on two public benchmarks, English--German and English--French translation tasks using the Multi30k dataset and English--Japanese translation tasks using the Flickr30kEnt-JP dataset prove that our model has a significant improvement with respect to the competitive baselines across different evaluation metrics and outperforms most of the existing MNMT models. For example, 1.0 BLEU scores are improved for the English-German task and 1.1 BLEU scores are improved for the English-French task on the Multi30k test2016 set; and 0.7 BLEU scores are improved for the English-Japanese task on the Flickr30kEnt-JP test set. Further analysis demonstrates that our model can achieve better translation performance by integrating WRA, leading to better visual information use
DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment
This paper presents DetCLIPv2, an efficient and scalable training framework
that incorporates large-scale image-text pairs to achieve open-vocabulary
object detection (OVD). Unlike previous OVD frameworks that typically rely on a
pre-trained vision-language model (e.g., CLIP) or exploit image-text pairs via
a pseudo labeling process, DetCLIPv2 directly learns the fine-grained
word-region alignment from massive image-text pairs in an end-to-end manner. To
accomplish this, we employ a maximum word-region similarity between region
proposals and textual words to guide the contrastive objective. To enable the
model to gain localization capability while learning broad concepts, DetCLIPv2
is trained with a hybrid supervision from detection, grounding and image-text
pair data under a unified data formulation. By jointly training with an
alternating scheme and adopting low-resolution input for image-text pairs,
DetCLIPv2 exploits image-text pair data efficiently and effectively: DetCLIPv2
utilizes 13X more image-text pairs than DetCLIP with a similar training time
and improves performance. With 13M image-text pairs for pre-training, DetCLIPv2
demonstrates superior open-vocabulary detection performance, e.g., DetCLIPv2
with Swin-T backbone achieves 40.4% zero-shot AP on the LVIS benchmark, which
outperforms previous works GLIP/GLIPv2/DetCLIP by 14.4/11.4/4.5% AP,
respectively, and even beats its fully-supervised counterpart by a large
margin.Comment: Accepted to CVPR202
Multimodal Neural Machine Translation based on Image-Text Semantic Correspondence
東京都立大学Tokyo Metropolitan University博士(情報科学)doctoral thesi
Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks
An important goal of computer vision is to build systems that learn visual
representations over time that can be applied to many tasks. In this paper, we
investigate a vision-language embedding as a core representation and show that
it leads to better cross-task transfer than standard multi-task learning. In
particular, the task of visual recognition is aligned to the task of visual
question answering by forcing each to use the same word-region embeddings. We
show this leads to greater inductive transfer from recognition to VQA than
standard multitask learning. Visual recognition also improves, especially for
categories that have relatively few recognition training labels but appear
often in the VQA setting. Thus, our paper takes a small step towards creating
more general vision systems by showing the benefit of interpretable, flexible,
and trainable core representations.Comment: Accepted in ICCV 2017. The arxiv version has an extra analysis on
correlation with human attentio
Localizing Active Objects from Egocentric Vision with Symbolic World Knowledge
The ability to actively ground task instructions from an egocentric view is
crucial for AI agents to accomplish tasks or assist humans virtually. One
important step towards this goal is to localize and track key active objects
that undergo major state change as a consequence of human actions/interactions
to the environment without being told exactly what/where to ground (e.g.,
localizing and tracking the `sponge` in video from the instruction "Dip the
`sponge` into the bucket."). While existing works approach this problem from a
pure vision perspective, we investigate to which extent the textual modality
(i.e., task instructions) and their interaction with visual modality can be
beneficial. Specifically, we propose to improve phrase grounding models'
ability on localizing the active objects by: (1) learning the role of `objects
undergoing change` and extracting them accurately from the instructions, (2)
leveraging pre- and post-conditions of the objects during actions, and (3)
recognizing the objects more robustly with descriptional knowledge. We leverage
large language models (LLMs) to extract the aforementioned action-object
knowledge, and design a per-object aggregation masking technique to effectively
perform joint inference on object phrases and symbolic knowledge. We evaluate
our framework on Ego4D and Epic-Kitchens datasets. Extensive experiments
demonstrate the effectiveness of our proposed framework, which leads to>54%
improvements in all standard metrics on the TREK-150-OPE-Det localization +
tracking task, >7% improvements in all standard metrics on the TREK-150-OPE
tracking task, and >3% improvements in average precision (AP) on the Ego4D SCOD
task.Comment: In Proceedings of the 2023 Conference on Empirical Methods in Natural
Language Processing (EMNLP