210 research outputs found
MuMUR : Multilingual Multimodal Universal Retrieval
Multi-modal retrieval has seen tremendous progress with the development of
vision-language models. However, further improving these models require
additional labelled data which is a huge manual effort. In this paper, we
propose a framework MuMUR, that utilizes knowledge transfer from a multilingual
model to boost the performance of multi-modal (image and video) retrieval. We
first use state-of-the-art machine translation models to construct pseudo
ground-truth multilingual visual-text pairs. We then use this data to learn a
joint vision-text representation where English and non-English text queries are
represented in a common embedding space based on pretrained multilingual
models. We evaluate our proposed approach on a diverse set of retrieval
datasets: five video retrieval datasets such as MSRVTT, MSVD, DiDeMo, Charades
and MSRVTT multilingual, two image retrieval datasets such as Flickr30k and
Multi30k . Experimental results demonstrate that our approach achieves
state-of-the-art results on all video retrieval datasets outperforming previous
models. Additionally, our framework MuMUR significantly beats other
multilingual video retrieval dataset. We also observe that MuMUR exhibits
strong performance on image retrieval. This demonstrates the universal ability
of MuMUR to perform retrieval across all visual inputs (image and video) and
text inputs (monolingual and multilingual).Comment: This is an extension of the previous MKTVR paper (for which you can
find a reference here :
https://dl.acm.org/doi/abs/10.1007/978-3-031-28244-7_42 or in a previous
version on arxiv). This version was published to the Information Retrieval
Journa
ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning
Two-Tower Vision-Language (VL) models have shown promising improvements on
various downstream VL tasks. Although the most advanced work improves
performance by building bridges between encoders, it suffers from ineffective
layer-by-layer utilization of uni-modal representations and cannot flexibly
exploit different levels of uni-modal semantic knowledge. In this work, we
propose ManagerTower, a novel VL model architecture that gathers and combines
the insights of pre-trained uni-modal experts at different levels. The managers
introduced in each cross-modal layer can adaptively aggregate uni-modal
semantic knowledge to facilitate more comprehensive cross-modal alignment and
fusion. ManagerTower outperforms previous strong baselines both with and
without Vision-Language Pre-training (VLP). With only 4M VLP data, ManagerTower
achieves superior performances on various downstream VL tasks, especially
79.15% accuracy on VQAv2 Test-Std, 86.56% IR@1 and 95.64% TR@1 on Flickr30K.
Code and checkpoints are available at https://github.com/LooperXX/ManagerTower.Comment: Accepted by ACL 2023 Main Conference, Ora
VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers
Breakthroughs in transformer-based models have revolutionized not only the
NLP field, but also vision and multimodal systems. However, although
visualization and interpretability tools have become available for NLP models,
internal mechanisms of vision and multimodal transformers remain largely
opaque. With the success of these transformers, it is increasingly critical to
understand their inner workings, as unraveling these black-boxes will lead to
more capable and trustworthy models. To contribute to this quest, we propose
VL-InterpreT, which provides novel interactive visualizations for interpreting
the attentions and hidden representations in multimodal transformers.
VL-InterpreT is a task agnostic and integrated tool that (1) tracks a variety
of statistics in attention heads throughout all layers for both vision and
language components, (2) visualizes cross-modal and intra-modal attentions
through easily readable heatmaps, and (3) plots the hidden representations of
vision and language tokens as they pass through the transformer layers. In this
paper, we demonstrate the functionalities of VL-InterpreT through the analysis
of KD-VLP, an end-to-end pretraining vision-language multimodal
transformer-based model, in the tasks of Visual Commonsense Reasoning (VCR) and
WebQA, two visual question answering benchmarks. Furthermore, we also present a
few interesting findings about multimodal transformer behaviors that were
learned through our tool.Comment: CVPR 2022 demo trac
Look, the World is Watching How We Treat Migrants! The Making of the Anti-Trafficking Legislation during the Ma Administration
Employing the spiral model, this research analyses how anti-human trafficking legislation was promulgated during the Ma Ying-jeou (Ma Yingjiu) presidency. This research found that the gov- ernment of Taiwan was just as accountable for the violation of mi- grants’ human rights as the exploitive placement agencies and abusive employers. This research argues that, given its reliance on the United States for political and security support, Taiwan has made great ef- forts to improve its human rights records and meet US standards for protecting human rights. The reform was a result of multilevel inputs, including US pressure and collaboration between transnational and domestic advocacy groups. A major contribution of this research is to challenge the belief that human rights protection is intrinsic to dem- ocracy. In the same light, this research also cautions against Taiwan’s subscription to US norms since the reform was achieved at the cost of stereotyping trafficking victimhood, legitimising state surveillance, and further marginalising sex workers
- …