24,379 research outputs found
LIUM-CVC Submissions for WMT17 Multimodal Translation Task
This paper describes the monomodal and multimodal Neural Machine Translation
systems developed by LIUM and CVC for WMT17 Shared Task on Multimodal
Translation. We mainly explored two multimodal architectures where either
global visual features or convolutional feature maps are integrated in order to
benefit from visual context. Our final systems ranked first for both En-De and
En-Fr language pairs according to the automatic evaluation metrics METEOR and
BLEU.Comment: MMT System Description Paper for WMT1
Doubly-Attentive Decoder for Multi-modal Neural Machine Translation
We introduce a Multi-modal Neural Machine Translation model in which a
doubly-attentive decoder naturally incorporates spatial visual features
obtained using pre-trained convolutional neural networks, bridging the gap
between image description and translation. Our decoder learns to attend to
source-language words and parts of an image independently by means of two
separate attention mechanisms as it generates words in the target language. We
find that our model can efficiently exploit not just back-translated in-domain
multi-modal data but also large general-domain text-only MT corpora. We also
report state-of-the-art results on the Multi30k data set.Comment: 8 pages (11 including references), 2 figure
Latent Variable Model for Multi-modal Translation
In this work, we propose to model the interaction between visual and textual
features for multi-modal neural machine translation (MMT) through a latent
variable model. This latent variable can be seen as a multi-modal stochastic
embedding of an image and its description in a foreign language. It is used in
a target-language decoder and also to predict image features. Importantly, our
model formulation utilises visual and textual inputs during training but does
not require that images be available at test time. We show that our latent
variable MMT formulation improves considerably over strong baselines, including
a multi-task learning approach (Elliott and K\'ad\'ar, 2017) and a conditional
variational auto-encoder approach (Toyama et al., 2016). Finally, we show
improvements due to (i) predicting image features in addition to only
conditioning on them, (ii) imposing a constraint on the minimum amount of
information encoded in the latent variable, and (iii) by training on additional
target-language image descriptions (i.e. synthetic data).Comment: Paper accepted at ACL 2019. Contains 8 pages (11 including
references, 13 including appendix), 6 figure
Multimodal Grounding for Sequence-to-Sequence Speech Recognition
Humans are capable of processing speech by making use of multiple sensory
modalities. For example, the environment where a conversation takes place
generally provides semantic and/or acoustic context that helps us to resolve
ambiguities or to recall named entities. Motivated by this, there have been
many works studying the integration of visual information into the speech
recognition pipeline. Specifically, in our previous work, we propose a
multistep visual adaptive training approach which improves the accuracy of an
audio-based Automatic Speech Recognition (ASR) system. This approach, however,
is not end-to-end as it requires fine-tuning the whole model with an adaptation
layer. In this paper, we propose novel end-to-end multimodal ASR systems and
compare them to the adaptive approach by using a range of visual
representations obtained from state-of-the-art convolutional neural networks.
We show that adaptive training is effective for S2S models leading to an
absolute improvement of 1.4% in word error rate. As for the end-to-end systems,
although they perform better than baseline, the improvements are slightly less
than adaptive training, 0.8 absolute WER reduction in single-best models. Using
ensemble decoding, end-to-end models reach a WER of 15% which is the lowest
score among all systems.Comment: ICASSP 201
- …