Search CORE

2 research outputs found

Towards Multimodal Simultaneous Neural Machine Translation

Author: Hirasawa Tosho
Imankulova Aizhan
Kaneko Masahiro
Komachi Mamoru
Publication venue
Publication date: 07/04/2020
Field of study

Simultaneous translation involves translating a sentence before the speaker's utterance is completed in order to realize real-time understanding in multiple languages. This task is significantly harder than the general full sentence translation because of the shortage of input information during decoding. To alleviate this shortage, we propose multimodal simultaneous neural machine translation (MSNMT) which leverages visual information as an additional modality. Although the usefulness of images as an additional modality is moderate for full sentence translation, we verified, for the first time, its importance for simultaneous translation. Our experiments with the Multi30k dataset showed that MSNMT in a simultaneous setting significantly outperforms its text-only counterpart in situations where 5 or fewer input tokens are needed to begin translation. We then verified the importance of visual information during decoding by (a) performing an adversarial evaluation of MSNMT where we studied how models behave with incongruent input modality and (b) analyzing the image attention

arXiv.org e-Print Archive

M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training

Author: Bharti Taroon
Cui Edward
Duan Nan
Gao Jianfeng
Huang Haoyang
Ni Minheng
Su Lin
Wang Lijuan
Zhang Dongdong
Publication venue
Publication date: 31/03/2021
Field of study

We present M3P, a Multitask Multilingual Multimodal Pre-trained model that combines multilingual pre-training and multimodal pre-training into a unified framework via multitask pre-training. Our goal is to learn universal representations that can map objects occurred in different modalities or texts expressed in different languages into a common semantic space. In addition, to explicitly encourage fine-grained alignment between images and non-English languages, we also propose Multimodal Code-switched Training (MCT) to combine monolingual pre-training and multimodal pre-training via a code-switch strategy. Experiments are performed on the multilingual image retrieval task across two benchmark datasets, including MSCOCO and Multi30K. M3P can achieve comparable results for English and new state-of-the-art results for non-English languages.Comment: Accepted to CVPR 202

arXiv.org e-Print Archive