2 research outputs found
Towards Multimodal Simultaneous Neural Machine Translation
Simultaneous translation involves translating a sentence before the speaker's
utterance is completed in order to realize real-time understanding in multiple
languages. This task is significantly harder than the general full sentence
translation because of the shortage of input information during decoding. To
alleviate this shortage, we propose multimodal simultaneous neural machine
translation (MSNMT) which leverages visual information as an additional
modality. Although the usefulness of images as an additional modality is
moderate for full sentence translation, we verified, for the first time, its
importance for simultaneous translation. Our experiments with the Multi30k
dataset showed that MSNMT in a simultaneous setting significantly outperforms
its text-only counterpart in situations where 5 or fewer input tokens are
needed to begin translation. We then verified the importance of visual
information during decoding by (a) performing an adversarial evaluation of
MSNMT where we studied how models behave with incongruent input modality and
(b) analyzing the image attention
M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training
We present M3P, a Multitask Multilingual Multimodal Pre-trained model that
combines multilingual pre-training and multimodal pre-training into a unified
framework via multitask pre-training. Our goal is to learn universal
representations that can map objects occurred in different modalities or texts
expressed in different languages into a common semantic space. In addition, to
explicitly encourage fine-grained alignment between images and non-English
languages, we also propose Multimodal Code-switched Training (MCT) to combine
monolingual pre-training and multimodal pre-training via a code-switch
strategy. Experiments are performed on the multilingual image retrieval task
across two benchmark datasets, including MSCOCO and Multi30K. M3P can achieve
comparable results for English and new state-of-the-art results for non-English
languages.Comment: Accepted to CVPR 202