Search CORE

210 research outputs found

MuMUR : Multilingual Multimodal Universal Retrieval

Author: Aflalo Estelle
Bertasius Gedas
Lal Vasudev
Madasu Avinash
Rosenman Shachar
Stan Gabriela Ben Melech
Tseng Shao-Yen
Publication venue
Publication date: 19/09/2023
Field of study

Multi-modal retrieval has seen tremendous progress with the development of vision-language models. However, further improving these models require additional labelled data which is a huge manual effort. In this paper, we propose a framework MuMUR, that utilizes knowledge transfer from a multilingual model to boost the performance of multi-modal (image and video) retrieval. We first use state-of-the-art machine translation models to construct pseudo ground-truth multilingual visual-text pairs. We then use this data to learn a joint vision-text representation where English and non-English text queries are represented in a common embedding space based on pretrained multilingual models. We evaluate our proposed approach on a diverse set of retrieval datasets: five video retrieval datasets such as MSRVTT, MSVD, DiDeMo, Charades and MSRVTT multilingual, two image retrieval datasets such as Flickr30k and Multi30k . Experimental results demonstrate that our approach achieves state-of-the-art results on all video retrieval datasets outperforming previous models. Additionally, our framework MuMUR significantly beats other multilingual video retrieval dataset. We also observe that MuMUR exhibits strong performance on image retrieval. This demonstrates the universal ability of MuMUR to perform retrieval across all visual inputs (image and video) and text inputs (monolingual and multilingual).Comment: This is an extension of the previous MKTVR paper (for which you can find a reference here : https://dl.acm.org/doi/abs/10.1007/978-3-031-28244-7_42 or in a previous version on arxiv). This version was published to the Information Retrieval Journa

arXiv.org e-Print Archive

ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning

Author: Bhiwandiwalla Anahita
Che Wanxiang
Duan Nan
Lal Vasudev
Li Bei
Rosenman Shachar
Tseng Shao-Yen
Wu Chenfei
Xu Xiao
Publication venue
Publication date: 31/05/2023
Field of study

Two-Tower Vision-Language (VL) models have shown promising improvements on various downstream VL tasks. Although the most advanced work improves performance by building bridges between encoders, it suffers from ineffective layer-by-layer utilization of uni-modal representations and cannot flexibly exploit different levels of uni-modal semantic knowledge. In this work, we propose ManagerTower, a novel VL model architecture that gathers and combines the insights of pre-trained uni-modal experts at different levels. The managers introduced in each cross-modal layer can adaptively aggregate uni-modal semantic knowledge to facilitate more comprehensive cross-modal alignment and fusion. ManagerTower outperforms previous strong baselines both with and without Vision-Language Pre-training (VLP). With only 4M VLP data, ManagerTower achieves superior performances on various downstream VL tasks, especially 79.15% accuracy on VQAv2 Test-Std, 86.56% IR@1 and 95.64% TR@1 on Flickr30K. Code and checkpoints are available at https://github.com/LooperXX/ManagerTower.Comment: Accepted by ACL 2023 Main Conference, Ora

arXiv.org e-Print Archive

VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers

Author: Aflalo Estelle
Du Meng
Duan Nan
Lal Vasudev
Liu Yongfei
Tseng Shao-Yen
Wu Chenfei
Publication venue
Publication date: 30/03/2022
Field of study

Breakthroughs in transformer-based models have revolutionized not only the NLP field, but also vision and multimodal systems. However, although visualization and interpretability tools have become available for NLP models, internal mechanisms of vision and multimodal transformers remain largely opaque. With the success of these transformers, it is increasingly critical to understand their inner workings, as unraveling these black-boxes will lead to more capable and trustworthy models. To contribute to this quest, we propose VL-InterpreT, which provides novel interactive visualizations for interpreting the attentions and hidden representations in multimodal transformers. VL-InterpreT is a task agnostic and integrated tool that (1) tracks a variety of statistics in attention heads throughout all layers for both vision and language components, (2) visualizes cross-modal and intra-modal attentions through easily readable heatmaps, and (3) plots the hidden representations of vision and language tokens as they pass through the transformer layers. In this paper, we demonstrate the functionalities of VL-InterpreT through the analysis of KD-VLP, an end-to-end pretraining vision-language multimodal transformer-based model, in the tasks of Visual Commonsense Reasoning (VCR) and WebQA, two visual question answering benchmarks. Furthermore, we also present a few interesting findings about multimodal transformer behaviors that were learned through our tool.Comment: CVPR 2022 demo trac

arXiv.org e-Print Archive

Look, the World is Watching How We Treat Migrants! The Making of the Anti-Trafficking Legislation during the Ma Administration

Author: Aksoy Sevilay
Alqama Khawaja
Bustgaard Martin Lee
Chang Hsuan-chun
Chen Kuang-wei
Chen Mei-hua
Cheng Chin-chin
Cheng Isabelle
Cheng Isabelle
Cheng Keng-liang
Chuang Janie
Fukuyama Francis
Hsia Hsiao-chuan
Hsia Hsiao-chuan
Hsu Ya-fei
Jervis Robert
Kaneko Kenji
King Winnie
Liao Bruce Yuan-hao
Lin Jia-he
Liu S. Dorothy
Martin Philip
Shao Yun-chung
Tsai Yu-dai
Tsay Ching-lung
Tseng Wen-chang
Tseng Yen-fen
Tseng Yu-chin
Wang Hiao-tan
Publication venue: 'SAGE Publications'
Publication date: 01/01/2017
Field of study

Employing the spiral model, this research analyses how anti-human trafficking legislation was promulgated during the Ma Ying-jeou (Ma Yingjiu) presidency. This research found that the gov- ernment of Taiwan was just as accountable for the violation of mi- grants’ human rights as the exploitive placement agencies and abusive employers. This research argues that, given its reliance on the United States for political and security support, Taiwan has made great ef- forts to improve its human rights records and meet US standards for protecting human rights. The reform was a result of multilevel inputs, including US pressure and collaboration between transnational and domestic advocacy groups. A major contribution of this research is to challenge the belief that human rights protection is intrinsic to dem- ocracy. In the same light, this research also cautions against Taiwan’s subscription to US norms since the reform was achieved at the cost of stereotyping trafficking victimhood, legitimising state surveillance, and further marginalising sex workers

CLoK

Crossref

SSOAR - Social Science Open Access Repository

Portsmouth University Research Portal (Pure)