6 research outputs found
Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense
Visual commonsense understanding requires Vision Language (VL) models to not
only understand image and text but also cross-reference in-between to fully
integrate and achieve comprehension of the visual scene described. Recently,
various approaches have been developed and have achieved high performance on
visual commonsense benchmarks. However, it is unclear whether the models really
understand the visual scene and underlying commonsense knowledge due to limited
evaluation data resources. To provide an in-depth analysis, we present a
Multimodal Evaluation (ME) pipeline to automatically generate question-answer
pairs to test models' understanding of the visual scene, text, and related
knowledge. We then take a step further to show that training with the ME data
boosts the model's performance in standard VCR evaluation. Lastly, our in-depth
analysis and comparison reveal interesting findings: (1) semantically low-level
information can assist the learning of high-level information but not the
opposite; (2) visual information is generally under utilization compared with
text.Comment: Accepted to EMNLP 2022 Long Pape
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models
The field of vision-and-language (VL) understanding has made unprecedented
progress with end-to-end large pre-trained VL models (VLMs). However, they
still fall short in zero-shot reasoning tasks that require multi-step
inferencing. To achieve this goal, previous works resort to a
divide-and-conquer pipeline. In this paper, we argue that previous efforts have
several inherent shortcomings: 1) They rely on domain-specific sub-question
decomposing models. 2) They force models to predict the final answer even if
the sub-questions or sub-answers provide insufficient information. We address
these limitations via IdealGPT, a framework that iteratively decomposes VL
reasoning using large language models (LLMs). Specifically, IdealGPT utilizes
an LLM to generate sub-questions, a VLM to provide corresponding sub-answers,
and another LLM to reason to achieve the final answer. These three modules
perform the divide-and-conquer procedure iteratively until the model is
confident about the final answer to the main question. We evaluate IdealGPT on
multiple challenging VL reasoning tasks under a zero-shot setting. In
particular, our IdealGPT outperforms the best existing GPT-4-like models by an
absolute 10% on VCR and 15% on SNLI-VE. Code is available at
https://github.com/Hxyou/IdealGPTComment: 13 pages, 5 figure
CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks
Contrastive language-image pretraining (CLIP) links vision and language
modalities into a unified embedding space, yielding the tremendous potential
for vision-language (VL) tasks. While early concurrent works have begun to
study this potential on a subset of tasks, important questions remain: 1) What
is the benefit of CLIP on unstudied VL tasks? 2) Does CLIP provide benefit in
low-shot or domain-shifted scenarios? 3) Can CLIP improve existing approaches
without impacting inference or pretraining complexity? In this work, we seek to
answer these questions through two key contributions. First, we introduce an
evaluation protocol that includes Visual Commonsense Reasoning (VCR), Visual
Entailment (SNLI-VE), and Visual Question Answering (VQA), across a variety of
data availability constraints and conditions of domain shift. Second, we
propose an approach, named CLIP Targeted Distillation (CLIP-TD), to
intelligently distill knowledge from CLIP into existing architectures using a
dynamically weighted objective applied to adaptively selected tokens per
instance. Experiments demonstrate that our proposed CLIP-TD leads to
exceptional gains in the low-shot (up to 51.9%) and domain-shifted (up to
71.3%) conditions of VCR, while simultaneously improving performance under
standard fully-supervised conditions (up to 2%), achieving state-of-art
performance on VCR compared to other single models that are pretrained with
image-text data only. On SNLI-VE, CLIP-TD produces significant gains in
low-shot conditions (up to 6.6%) as well as fully supervised (up to 3%). On
VQA, CLIP-TD provides improvement in low-shot (up to 9%), and in
fully-supervised (up to 1.3%). Finally, CLIP-TD outperforms concurrent works
utilizing CLIP for finetuning, as well as baseline naive distillation
approaches. Code will be made available.Comment: This paper is greatly modified and updated to be re-submitted to
another conference. The new paper is under the name "Multimodal Adaptive
Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks",
https://doi.org/10.48550/arXiv.2204.1049