803 research outputs found
ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering
We propose a novel attention based deep learning architecture for visual
question answering task (VQA). Given an image and an image related natural
language question, VQA generates the natural language answer for the question.
Generating the correct answers requires the model's attention to focus on the
regions corresponding to the question, because different questions inquire
about the attributes of different image regions. We introduce an attention
based configurable convolutional neural network (ABC-CNN) to learn such
question-guided attention. ABC-CNN determines an attention map for an
image-question pair by convolving the image feature map with configurable
convolutional kernels derived from the question's semantics. We evaluate the
ABC-CNN architecture on three benchmark VQA datasets: Toronto COCO-QA, DAQUAR,
and VQA dataset. ABC-CNN model achieves significant improvements over
state-of-the-art methods on these datasets. The question-guided attention
generated by ABC-CNN is also shown to reflect the regions that are highly
relevant to the questions
Parallel Attention: A Unified Framework for Visual Object Discovery through Dialogs and Queries
Recognising objects according to a pre-defined fixed set of class labels has
been well studied in the Computer Vision. There are a great many practical
applications where the subjects that may be of interest are not known
beforehand, or so easily delineated, however. In many of these cases natural
language dialog is a natural way to specify the subject of interest, and the
task achieving this capability (a.k.a, Referring Expression Comprehension) has
recently attracted attention. To this end we propose a unified framework, the
ParalleL AttentioN (PLAN) network, to discover the object in an image that is
being referred to in variable length natural expression descriptions, from
short phrases query to long multi-round dialogs. The PLAN network has two
attention mechanisms that relate parts of the expressions to both the global
visual content and also directly to object candidates. Furthermore, the
attention mechanisms are recurrent, making the referring process visualizable
and explainable. The attended information from these dual sources are combined
to reason about the referred object. These two attention mechanisms can be
trained in parallel and we find the combined system outperforms the
state-of-art on several benchmarked datasets with different length language
input, such as RefCOCO, RefCOCO+ and GuessWhat?!.Comment: 11 page
Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models
As general purpose vision models get increasingly effective at a wide set of
tasks, it is imperative that they be consistent across the tasks they support.
Inconsistent AI models are considered brittle and untrustworthy by human users
and are more challenging to incorporate into larger systems that take
dependencies on their outputs. Measuring consistency between very heterogeneous
tasks that might include outputs in different modalities is challenging since
it is difficult to determine if the predictions are consistent with one
another. As a solution, we introduce a benchmark dataset, COCOCON, where we use
contrast sets created by modifying test instances for multiple tasks in small
but semantically meaningful ways to change the gold label, and outline metrics
for measuring if a model is consistent by ranking the original and perturbed
instances across tasks. We find that state-of-the-art systems suffer from a
surprisingly high degree of inconsistent behavior across tasks, especially for
more heterogeneous tasks. Finally, we propose using a rank correlation-based
auxiliary objective computed over large automatically created cross-task
contrast sets to improve the multi-task consistency of large unified models,
while retaining their original accuracy on downstream tasks. Project website
available at https://adymaharana.github.io/cococon/Comment: Project Website: https://adymaharana.github.io/cococon
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
We introduce the Qwen-VL series, a set of large-scale vision-language models
designed to perceive and understand both text and images. Comprising Qwen-VL
and Qwen-VL-Chat, these models exhibit remarkable performance in tasks like
image captioning, question answering, visual localization, and flexible
interaction. The evaluation covers a wide range of tasks including zero-shot
captioning, visual or document visual question answering, and grounding. We
demonstrate the Qwen-VL outperforms existing Large Vision Language Models
(LVLMs). We present their architecture, training, capabilities, and
performance, highlighting their contributions to advancing multimodal
artificial intelligence. Code, demo and models are available at
https://github.com/QwenLM/Qwen-VL.Comment: Code, demo and models are available at
https://github.com/QwenLM/Qwen-V
- …