ViCor: Bridging Visual Understanding and Commonsense Reasoning with
  Large Language Models

Lee, Kwonjoon; Misu, Teruhisa; Wang, Xin Eric; Zhou, Kaiwen

ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models

Authors: Kwonjoon Lee
Teruhisa Misu
Xin Eric Wang
Kaiwen Zhou
Publication date: 9 October 2023
Publisher

Abstract

In our work, we explore the synergistic capabilities of pre-trained vision-and-language models (VLMs) and large language models (LLMs) for visual commonsense reasoning (VCR). We categorize the problem of VCR into visual commonsense understanding (VCU) and visual commonsense inference (VCI). For VCU, which involves perceiving the literal visual content, pre-trained VLMs exhibit strong cross-dataset generalization. On the other hand, in VCI, where the goal is to infer conclusions beyond image content, VLMs face difficulties. We find that a baseline where VLMs provide perception results (image captions) to LLMs leads to improved performance on VCI. However, we identify a challenge with VLMs' passive perception, which often misses crucial context information, leading to incorrect or uncertain reasoning by LLMs. To mitigate this issue, we suggest a collaborative approach where LLMs, when uncertain about their reasoning, actively direct VLMs to concentrate on and gather relevant visual elements to support potential commonsense inferences. In our method, named ViCor, pre-trained LLMs serve as problem classifiers to analyze the problem category, VLM commanders to leverage VLMs differently based on the problem classification, and visual commonsense reasoners to answer the question. VLMs will perform visual recognition and understanding. We evaluate our framework on two VCR benchmark datasets and outperform all other methods that do not require in-domain supervised fine-tuning

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2310.05872

Last time updated on 14/12/2023