923 research outputs found
A Correlational Encoder Decoder Architecture for Pivot Based Sequence Generation
Interlingua based Machine Translation (MT) aims to encode multiple languages
into a common linguistic representation and then decode sentences in multiple
target languages from this representation. In this work we explore this idea in
the context of neural encoder decoder architectures, albeit on a smaller scale
and without MT as the end goal. Specifically, we consider the case of three
languages or modalities X, Z and Y wherein we are interested in generating
sequences in Y starting from information available in X. However, there is no
parallel training data available between X and Y but, training data is
available between X & Z and Z & Y (as is often the case in many real world
applications). Z thus acts as a pivot/bridge. An obvious solution, which is
perhaps less elegant but works very well in practice is to train a two stage
model which first converts from X to Z and then from Z to Y. Instead we explore
an interlingua inspired solution which jointly learns to do the following (i)
encode X and Z to a common representation and (ii) decode Y from this common
representation. We evaluate our model on two tasks: (i) bridge transliteration
and (ii) bridge captioning. We report promising results in both these
applications and believe that this is a right step towards truly interlingua
inspired encoder decoder architectures.Comment: 10 page
Weakly Supervised Content Selection for Improved Image Captioning
Image captioning involves identifying semantic concepts in the scene and
describing them in fluent natural language. Recent approaches do not explicitly
model the semantic concepts and train the model only for the end goal of
caption generation. Such models lack interpretability and controllability,
primarily due to sub-optimal content selection. We address this problem by
breaking down the captioning task into two simpler, manageable and more
controllable tasks -- skeleton prediction and skeleton-based caption
generation. We approach the former as a weakly supervised task, using a simple
off-the-shelf language syntax parser and avoiding the need for additional human
annotations; the latter uses a supervised-learning approach. We investigate
three methods of conditioning the caption on skeleton in the encoder, decoder
and both. Our compositional model generates significantly better quality
captions on out of domain test images, as judged by human annotators.
Additionally, we demonstrate the cross-language effectiveness of the English
skeleton to other languages including French, Italian, German, Spanish and
Hindi. This compositional nature of captioning exhibits the potential of
unpaired image captioning, thereby reducing the dependence on expensive
image-caption pairs. Furthermore, we investigate the use of skeletons as a knob
to control certain properties of the generated image caption, such as length,
content, and gender expression
Exploring Pre-Trained Model and Language Model for Translating Image to Bahasa
In the last decade, there have been significant developments in Image Caption Generation research to translate images into English descriptions. This task has also been conducted to produce texts in non-English, including Bahasa. However, the references in this study are still limited, so exploration opportunities are open widely. This paper presents comparative research by examining several state-of-the-art Deep Learning algorithms to extract images and generate their descriptions in Bahasa. We extracted images using three pre-trained models, namely InceptionV3, Xception, and EfficientNetV2S. In the language model, we examined four architectures: LSTM, GRU, Bidirectional LSTM, and Bidirectional GRU. The database used was Flickr8k which was translated into Bahasa. Model evaluation was conducted using BLEU and Meteor. The performance results based on the pre-trained model showed that EfficientNetV3S significantly gave the highest score among other models. On the other hand, in the language model, there was only a slight difference in model performance. However, in general, the Bidirectional GRU scored higher. We also found that step size in training affected overfitting. Larger step sizes tended to provide better generalizations. The best model was generated using EfficientNetV3S and Bidirectional GRU with step size=4096, which resulted in an average score of BLEU-1=0,5828 and Meteor=0,4520
CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP Performance on Low-Resource Languages
This work introduces CAPIVARA, a cost-efficient framework designed to enhance
the performance of multilingual CLIP models in low-resource languages. While
CLIP has excelled in zero-shot vision-language tasks, the resource-intensive
nature of model training remains challenging. Many datasets lack linguistic
diversity, featuring solely English descriptions for images. CAPIVARA addresses
this by augmenting text data using image captioning and machine translation to
generate multiple synthetic captions in low-resource languages. We optimize the
training pipeline with LiT, LoRA, and gradient checkpointing to alleviate the
computational cost. Through extensive experiments, CAPIVARA emerges as state of
the art in zero-shot tasks involving images and Portuguese texts. We show the
potential for significant improvements in other low-resource languages,
achieved by fine-tuning the pre-trained multilingual CLIP using CAPIVARA on a
single GPU for 2 hours. Our model and code is available at
https://github.com/hiaac-nlp/CAPIVARA
Neural machine translation for multimodal interaction
Typically it is seen that multimodal neural machine translation (MNMT) systems
trained on a combination of visual and textual inputs produce better translations
than systems trained using only textual inputs. The task of such systems can be
decomposed into two sub-tasks: learning visually grounded representations from
images and translation of the textual counterparts using those representations. In a
multi-task learning framework, translations are generated from an attention-based
encoder-decoder framework and grounded representations that are learned from pretrained convolutional neural networks (CNNs) for classifying images.
In this thesis, I study different computational techniques to translate the meaning of sentences from one language into another considering the visual modality
as a naturally occurring meaning representation bridging between languages. We
examine the behaviour of state-of-the-art MNMT systems from the data perspective in order to understand the role of the both textual and visual inputs in such
systems. We evaluate our models on the Multi30k, a large-scale multilingual multimodal dataset publicly available for machine learning research. Our results in the optimal and sparse data settings show that the differences in translation system
performance are proportional to the amount of both visual and linguistic information whereas, in the adversarial condition the effect of the visual modality is rather
small or negligible. The chapters of the thesis follow a progression starting with using different state-of-the-art MMT models for incorporating images in optimal data
settings to creating synthetic image data under the low-resource scenario and extending to addition of adversarial perturbations to the textual input for evaluating
the real contribution of images
- …