Search CORE

14 research outputs found

DCU-UvA Multimodal MT System Report

Author: Calixto Iacer
Elliott Desmond
Frank Stella
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2016
Field of study

Edinburgh Research Explorer

Publikationsserver der RWTH Aachen University

Doubly-Attentive Decoder for Multi-modal Neural Machine Translation

Author: Calixto Iacer
Campbell Nick
Liu Qun
Publication venue
Publication date: 01/01/2017
Field of study

We introduce a Multi-modal Neural Machine Translation model in which a doubly-attentive decoder naturally incorporates spatial visual features obtained using pre-trained convolutional neural networks, bridging the gap between image description and translation. Our decoder learns to attend to source-language words and parts of an image independently by means of two separate attention mechanisms as it generates words in the target language. We find that our model can efficiently exploit not just back-translated in-domain multi-modal data but also large general-domain text-only MT corpora. We also report state-of-the-art results on the Multi30k data set.Comment: 8 pages (11 including references), 2 figure

arXiv.org e-Print Archive

Crossref

Using images to improve machine-translating E-commerce product listings

Author: Calixto Iacer
Castilho Sheila
Lohar Pintu
Matusov Evgeny
Stein Daniel
Way Andy
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2017
Field of study

In this paper we study the impact of using images to machine-translate user-generated ecommerce product listings. We study how a multi-modal Neural Machine Translation (NMT) model compares to two text-only approaches: a conventional state-of-the-art attentional NMT and a Statistical Machine Translation (SMT) model. User-generated product listings often do not constitute grammatical or well-formed sentences. More often than not, they consist of the juxtaposition of short phrases or keywords. We train our models end-to-end as well as use text-only and multimodal NMT models for re-ranking n-best lists generated by an SMT model. We qualitatively evaluate our user-generated training data also analyse how adding synthetic data impacts the results. We evaluate our models quantitatively using BLEU and TER and find that (i) additional synthetic data has a general positive impact on text-only and multi-modal NMT models, and that (ii) using a multi-modal NMT model for re-ranking n-best lists improves TER significantly across different nbest list sizes

Irish Universities

DCU Online Research Access Service

Probing the need for visual context in multimodal machine translation

Author: Barrault L.
Caglayan O.
Madhyastha P.
Specia L.
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

Current work on multimodal machine translation (MMT) has suggested that the visual modality is either unnecessary or only marginally beneficial. We posit that this is a consequence of the very simple, short and repetitive sentences used in the only available dataset for the task (Multi30K), rendering the source text sufficient as context. In the general case, however, we believe that it is possible to combine visual and textual information in order to ground translations. In this paper we probe the contribution of the visual modality to state-of-the-art MMT models by conducting a systematic analysis where we partially deprive the models from source-side textual context. Our results show that under limited textual context, models are capable of leveraging the visual input to generate better translations. This contradicts the current belief that MMT models disregard the visual modality because of either the quality of the image features or the way they are integrated into the model

arXiv.org e-Print Archive

Crossref

Spiral - Imperial College Digital Repository

White Rose Research Online

Region-Attentive Multimodal Neural Machine Translation

Author: Chu Chenhui
Kajiwara Tomoyuki
Komachi Mamoru
Zhao Yuting
Publication venue: 'Elsevier BV'
Publication date: 01/03/2022
Field of study

We propose a multimodal neural machine translation (MNMT) method with semantic image regions called region-attentive multimodal neural machine translation (RA-NMT). Existing studies on MNMT have mainly focused on employing global visual features or equally sized grid local visual features extracted by convolutional neural networks (CNNs) to improve translation performance. However, they neglect the effect of semantic information captured inside the visual features. This study utilizes semantic image regions extracted by object detection for MNMT and integrates visual and textual features using two modality-dependent attention mechanisms. The proposed method was implemented and verified on two neural architectures of neural machine translation (NMT): recurrent neural network (RNN) and self-attention network (SAN). Experimental results on different language pairs of Multi30k dataset show that our proposed method improves over baselines and outperforms most of the state-of-the-art MNMT methods. Further analysis demonstrates that the proposed method can achieve better translation performance because of its better visual feature use

Kyoto University Research Information Repository

A Shared Task on Multimodal Machine Translation and Crosslingual Image Description

Author: Elliott Desmond
Frank Stella
Sima'an Khalil
Specia Lucia
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2016
Field of study

Edinburgh Research Explorer

Publikationsserver der RWTH Aachen University

The role of image representations in vision to language tasks

Author: Bernardi
Hodosh
JOSIAH WANG
LUCIA SPECIA
PRANAVA MADHYASTHA
Socher
van der Maaten
Young
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 18/02/2018
Field of study

Tasks that require modeling of both language and visual information such as image captioning have become very popular in recent years. Most state-of-the-art approaches make use of image representations obtained from a deep neural network, which are used to generate language information in a variety of ways with end-to-end neural network-based models. However, it is not clear how different image representations contribute to language generation tasks. In this paper, we probe the representational contribution of the image features in an end-to-end neural modeling framework and study the properties of different types of image representations. We focus on two popular vision to language problems: the task of image captioning and the task of multimodal machine translation. Our analysis provides interesting insights into the representational properties and suggests that end-to-end approaches implicitly learn a visual-semantic subspace and exploit the subspace to generate captions

Crossref

Spiral - Imperial College Digital Repository

White Rose Research Online