24 research outputs found
Tag: Automated Image Captioning
Many websites remain non-ADA compliant, containing images which lack accompanying textual descriptions. This leaves sight-impaired individuals unable to fully enjoy the rich wonders of the web. To address this inequity, our research aims to create an autonomous system capable of generating semantically accurate descriptions of images. This problem involves two tasks: recognizing an image and linguistically describing it. Our solution uses state-of-the-art deep learning: employing a convolutional neural network that learns to understand images and extracts their salient features, and a recurrent neural network that learns to generate structured, coherent sentences. These two networks are merged to create a single model that takes as input arbitrary images and outputs relevant captions. The model\u27s accuracy is quantified using various language metrics, such as the Bilingual Evaluation Understudy designed to rate language translation systems. After training, we hope to validate our approach by deploying our model on local, online social media feeds
Image Captioning with Unseen Objects
Image caption generation is a long standing and challenging problem at the
intersection of computer vision and natural language processing. A number of
recently proposed approaches utilize a fully supervised object recognition
model within the captioning approach. Such models, however, tend to generate
sentences which only consist of objects predicted by the recognition models,
excluding instances of the classes without labelled training examples. In this
paper, we propose a new challenging scenario that targets the image captioning
problem in a fully zero-shot learning setting, where the goal is to be able to
generate captions of test images containing objects that are not seen during
training. The proposed approach jointly uses a novel zero-shot object detection
model and a template-based sentence generator. Our experiments show promising
results on the COCO dataset.Comment: To appear in British Machine Vision Conference (BMVC) 201
ADVISE: Symbolism and External Knowledge for Decoding Advertisements
In order to convey the most content in their limited space, advertisements
embed references to outside knowledge via symbolism. For example, a motorcycle
stands for adventure (a positive property the ad wants associated with the
product being sold), and a gun stands for danger (a negative property to
dissuade viewers from undesirable behaviors). We show how to use symbolic
references to better understand the meaning of an ad. We further show how
anchoring ad understanding in general-purpose object recognition and image
captioning improves results. We formulate the ad understanding task as matching
the ad image to human-generated statements that describe the action that the ad
prompts, and the rationale it provides for taking this action. Our proposed
method outperforms the state of the art on this task, and on an alternative
formulation of question-answering on ads. We show additional applications of
our learned representations for matching ads to slogans, and clustering ads
according to their topic, without extra training.Comment: To appear, Proceedings of the European Conference on Computer Vision
(ECCV
Fooling Vision and Language Models Despite Localization and Attention Mechanism
Adversarial attacks are known to succeed on classifiers, but it has been an
open question whether more complex vision systems are vulnerable. In this
paper, we study adversarial examples for vision and language models, which
incorporate natural language understanding and complex structures such as
attention, localization, and modular architectures. In particular, we
investigate attacks on a dense captioning model and on two visual question
answering (VQA) models. Our evaluation shows that we can generate adversarial
examples with a high success rate (i.e., > 90%) for these models. Our work
sheds new light on understanding adversarial attacks on vision systems which
have a language component and shows that attention, bounding box localization,
and compositional internal structures are vulnerable to adversarial attacks.
These observations will inform future work towards building effective defenses.Comment: CVPR 201
Fooling Vision and Language Models Despite Localization and Attention Mechanism
Adversarial attacks are known to succeed on classifiers, but it has been an
open question whether more complex vision systems are vulnerable. In this
paper, we study adversarial examples for vision and language models, which
incorporate natural language understanding and complex structures such as
attention, localization, and modular architectures. In particular, we
investigate attacks on a dense captioning model and on two visual question
answering (VQA) models. Our evaluation shows that we can generate adversarial
examples with a high success rate (i.e., > 90%) for these models. Our work
sheds new light on understanding adversarial attacks on vision systems which
have a language component and shows that attention, bounding box localization,
and compositional internal structures are vulnerable to adversarial attacks.
These observations will inform future work towards building effective defenses.Comment: CVPR 201