Search CORE

17 research outputs found

The MeMAD Submission to the WMT18 Multimodal Translation Task

Author: Grönroos Stig-Arne
Huet Benoit
Kurimo Mikko
Laaksonen Jorma
Merialdo Bernard
Pham Phu
Sjöberg Mats
Sulubacak Umut
Tiedemann Jörg
Troncy Raphaël
Vázquez Carrillo Juan Raúl
Publication venue: The Association for Computational Linguistics
Publication date: 01/01/2018
Field of study

This paper describes the MeMAD project entry to the WMT Multimodal Machine Translation Shared Task. We propose adapting the Transformer neural machine translation (NMT) architecture to a multi-modal setting. In this paper, we also describe the preliminary experiments with text-only translation systems leading us up to this choice. We have the top scoring system for both English-to-German and English-to-French, according to the automatic metrics for flickr18. Our experiments show that the effect of the visual features in our system is small. Our largest gains come from the quality of the underlying text-only NMT system. We find that appropriate use of additional data is effective.Peer reviewe

arXiv.org e-Print Archive

Crossref

Aaltodoc Publication Archive

Helsingin yliopiston digitaalinen arkisto

Word-Region Alignment-Guided Multimodal Neural Machine Translation

Author: Chu Chenhui
Kajiwara Tomoyuki
Komachi Mamoru
Zhao Yuting
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2022
Field of study

We propose word-region alignment-guided multimodal neural machine translation (MNMT), a novel model for MNMT that links the semantic correlation between textual and visual modalities using word-region alignment (WRA). Existing studies on MNMT have mainly focused on the effect of integrating visual and textual modalities. However, they do not leverage the semantic relevance between the two modalities. We advance the semantic correlation between textual and visual modalities in MNMT by incorporating WRA as a bridge. This proposal has been implemented on two mainstream architectures of neural machine translation (NMT): the recurrent neural network (RNN) and the transformer. Experiments on two public benchmarks, English--German and English--French translation tasks using the Multi30k dataset and English--Japanese translation tasks using the Flickr30kEnt-JP dataset prove that our model has a significant improvement with respect to the competitive baselines across different evaluation metrics and outperforms most of the existing MNMT models. For example, 1.0 BLEU scores are improved for the English-German task and 1.1 BLEU scores are improved for the English-French task on the Multi30k test2016 set; and 0.7 BLEU scores are improved for the English-Japanese task on the Flickr30kEnt-JP test set. Further analysis demonstrates that our model can achieve better translation performance by integrating WRA, leading to better visual information use

Kyoto University Research Information Repository

Double Attention-based Multimodal Neural Machine Translation with Semantic Image Regions

Author: Zhao Yuting
チョウウテイ
Publication venue
Publication date: 25/03/2020
Field of study

首都大学東

Tokyo Metropolitan University Institutional Repository Miyako-Dori / 首都大学東京機関リポジトリ

Institutional Repositories DataBase (IRDB)

Region-Attentive Multimodal Neural Machine Translation

Author: Chu Chenhui
Kajiwara Tomoyuki
Komachi Mamoru
Zhao Yuting
Publication venue: 'Elsevier BV'
Publication date: 01/03/2022
Field of study

We propose a multimodal neural machine translation (MNMT) method with semantic image regions called region-attentive multimodal neural machine translation (RA-NMT). Existing studies on MNMT have mainly focused on employing global visual features or equally sized grid local visual features extracted by convolutional neural networks (CNNs) to improve translation performance. However, they neglect the effect of semantic information captured inside the visual features. This study utilizes semantic image regions extracted by object detection for MNMT and integrates visual and textual features using two modality-dependent attention mechanisms. The proposed method was implemented and verified on two neural architectures of neural machine translation (NMT): recurrent neural network (RNN) and self-attention network (SAN). Experimental results on different language pairs of Multi30k dataset show that our proposed method improves over baselines and outperforms most of the state-of-the-art MNMT methods. Further analysis demonstrates that the proposed method can achieve better translation performance because of its better visual feature use

Kyoto University Research Information Repository

Dynamic Context-guided Capsule Network for Multimodal Machine Translation

Author: Anderson Peter
Bahdanau Dzmitry
Caglayan Ozan
Desmond
He Kaiming
Jaiswal Ayush
Klein Guillaume
Michael
Papineni Kishore
Sabour Sara
Singh Maneet
Stig-Arne
Su Jinsong
Vaswani Ashish
Wang Mingxuan
Wu Qi
Xinyi Zhang
Yang Zhengxin
Zhang Xiangwen
Zheng Zaixiang
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 04/09/2020
Field of study

Multimodal machine translation (MMT), which mainly focuses on enhancing text-only translation with visual features, has attracted considerable attention from both computer vision and natural language processing communities. Most current MMT models resort to attention mechanism, global context modeling or multimodal joint representation learning to utilize visual features. However, the attention mechanism lacks sufficient semantic interactions between modalities while the other two provide fixed visual context, which is unsuitable for modeling the observed variability when generating translation. To address the above issues, in this paper, we propose a novel Dynamic Context-guided Capsule Network (DCCN) for MMT. Specifically, at each timestep of decoding, we first employ the conventional source-target attention to produce a timestep-specific source-side context vector. Next, DCCN takes this vector as input and uses it to guide the iterative extraction of related visual features via a context-guided dynamic routing mechanism. Particularly, we represent the input image with global and regional visual features, we introduce two parallel DCCNs to model multimodal context vectors with visual features at different granularities. Finally, we obtain two multimodal context vectors, which are fused and incorporated into the decoder for the prediction of the target word. Experimental results on the Multi30K dataset of English-to-German and English-to-French translation demonstrate the superiority of DCCN. Our code is available on https://github.com/DeepLearnXMU/MM-DCCN

arXiv.org e-Print Archive

Crossref