thesis

Neural machine translation for multimodal interaction

Abstract

Typically it is seen that multimodal neural machine translation (MNMT) systems trained on a combination of visual and textual inputs produce better translations than systems trained using only textual inputs. The task of such systems can be decomposed into two sub-tasks: learning visually grounded representations from images and translation of the textual counterparts using those representations. In a multi-task learning framework, translations are generated from an attention-based encoder-decoder framework and grounded representations that are learned from pretrained convolutional neural networks (CNNs) for classifying images. In this thesis, I study different computational techniques to translate the meaning of sentences from one language into another considering the visual modality as a naturally occurring meaning representation bridging between languages. We examine the behaviour of state-of-the-art MNMT systems from the data perspective in order to understand the role of the both textual and visual inputs in such systems. We evaluate our models on the Multi30k, a large-scale multilingual multimodal dataset publicly available for machine learning research. Our results in the optimal and sparse data settings show that the differences in translation system performance are proportional to the amount of both visual and linguistic information whereas, in the adversarial condition the effect of the visual modality is rather small or negligible. The chapters of the thesis follow a progression starting with using different state-of-the-art MMT models for incorporating images in optimal data settings to creating synthetic image data under the low-resource scenario and extending to addition of adversarial perturbations to the textual input for evaluating the real contribution of images

    Similar works