6,377 research outputs found
Towards Adaptable and Interactive Image Captioning with Data Augmentation and Episodic Memory
Interactive machine learning (IML) is a beneficial learning paradigm in cases
of limited data availability, as human feedback is incrementally integrated
into the training process. In this paper, we present an IML pipeline for image
captioning which allows us to incrementally adapt a pre-trained image
captioning model to a new data distribution based on user input. In order to
incorporate user input into the model, we explore the use of a combination of
simple data augmentation methods to obtain larger data batches for each newly
annotated data instance and implement continual learning methods to prevent
catastrophic forgetting from repeated updates. For our experiments, we split a
domain-specific image captioning dataset, namely VizWiz, into non-overlapping
parts to simulate an incremental input flow for continually adapting the model
to new data. We find that, while data augmentation worsens results, even when
relatively small amounts of data are available, episodic memory is an effective
strategy to retain knowledge from previously seen clusters
Neural Baby Talk
We introduce a novel framework for image captioning that can produce natural
language explicitly grounded in entities that object detectors find in the
image. Our approach reconciles classical slot filling approaches (that are
generally better grounded in images) with modern neural captioning approaches
(that are generally more natural sounding and accurate). Our approach first
generates a sentence `template' with slot locations explicitly tied to specific
image regions. These slots are then filled in by visual concepts identified in
the regions by object detectors. The entire architecture (sentence template
generation and slot filling with object detectors) is end-to-end
differentiable. We verify the effectiveness of our proposed model on different
image captioning tasks. On standard image captioning and novel object
captioning, our model reaches state-of-the-art on both COCO and Flickr30k
datasets. We also demonstrate that our model has unique advantages when the
train and test distributions of scene compositions -- and hence language priors
of associated captions -- are different. Code has been made available at:
https://github.com/jiasenlu/NeuralBabyTalkComment: 12 pages, 7 figures, CVPR 201
Guided Open Vocabulary Image Captioning with Constrained Beam Search
Existing image captioning models do not generalize well to out-of-domain
images containing novel scenes or objects. This limitation severely hinders the
use of these models in real world applications dealing with images in the wild.
We address this problem using a flexible approach that enables existing deep
captioning architectures to take advantage of image taggers at test time,
without re-training. Our method uses constrained beam search to force the
inclusion of selected tag words in the output, and fixed, pretrained word
embeddings to facilitate vocabulary expansion to previously unseen tag words.
Using this approach we achieve state of the art results for out-of-domain
captioning on MSCOCO (and improved results for in-domain captioning). Perhaps
surprisingly, our results significantly outperform approaches that incorporate
the same tag predictions into the learning algorithm. We also show that we can
significantly improve the quality of generated ImageNet captions by leveraging
ground-truth labels.Comment: EMNLP 201
FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning
Multimodal tasks in the fashion domain have significant potential for
e-commerce, but involve challenging vision-and-language learning problems -
e.g., retrieving a fashion item given a reference image plus text feedback from
a user. Prior works on multimodal fashion tasks have either been limited by the
data in individual benchmarks, or have leveraged generic vision-and-language
pre-training but have not taken advantage of the characteristics of fashion
data. Additionally, these works have mainly been restricted to multimodal
understanding tasks. To address these gaps, we make two key contributions.
First, we propose a novel fashion-specific pre-training framework based on
weakly-supervised triplets constructed from fashion image-text pairs. We show
the triplet-based tasks are an effective addition to standard multimodal
pre-training tasks. Second, we propose a flexible decoder-based model
architecture capable of both fashion retrieval and captioning tasks. Together,
our model design and pre-training approach are competitive on a diverse set of
fashion tasks, including cross-modal retrieval, image retrieval with text
feedback, image captioning, relative image captioning, and multimodal
categorization.Comment: 14 pages, 4 figures. To appear at Conference on Empirical Methods in
Natural Language Processing (EMNLP) 202
Domain Adaptation for Neural Networks by Parameter Augmentation
We propose a simple domain adaptation method for neural networks in a
supervised setting. Supervised domain adaptation is a way of improving the
generalization performance on the target domain by using the source domain
dataset, assuming that both of the datasets are labeled. Recently, recurrent
neural networks have been shown to be successful on a variety of NLP tasks such
as caption generation; however, the existing domain adaptation techniques are
limited to (1) tune the model parameters by the target dataset after the
training by the source dataset, or (2) design the network to have dual output,
one for the source domain and the other for the target domain. Reformulating
the idea of the domain adaptation technique proposed by Daume (2007), we
propose a simple domain adaptation method, which can be applied to neural
networks trained with a cross-entropy loss. On captioning datasets, we show
performance improvements over other domain adaptation methods.Comment: 9 page. To appear in the first ACL Workshop on Representation
Learning for NL
Generating image captions with external encyclopedic knowledge
Accurately reporting what objects are depicted in an image is largely a
solved problem in automatic caption generation. The next big challenge on the
way to truly humanlike captioning is being able to incorporate the context of
the image and related real world knowledge. We tackle this challenge by
creating an end-to-end caption generation system that makes extensive use of
image-specific encyclopedic data. Our approach includes a novel way of using
image location to identify relevant open-domain facts in an external knowledge
base, with their subsequent integration into the captioning pipeline at both
the encoding and decoding stages. Our system is trained and tested on a new
dataset with naturally produced knowledge-rich captions, and achieves
significant improvements over multiple baselines. We empirically demonstrate
that our approach is effective for generating contextualized captions with
encyclopedic knowledge that is both factually accurate and relevant to the
image
- …