In our work, we explore the synergistic capabilities of pre-trained
vision-and-language models (VLMs) and large language models (LLMs) for visual
commonsense reasoning (VCR). We categorize the problem of VCR into visual
commonsense understanding (VCU) and visual commonsense inference (VCI). For
VCU, which involves perceiving the literal visual content, pre-trained VLMs
exhibit strong cross-dataset generalization. On the other hand, in VCI, where
the goal is to infer conclusions beyond image content, VLMs face difficulties.
We find that a baseline where VLMs provide perception results (image captions)
to LLMs leads to improved performance on VCI. However, we identify a challenge
with VLMs' passive perception, which often misses crucial context information,
leading to incorrect or uncertain reasoning by LLMs. To mitigate this issue, we
suggest a collaborative approach where LLMs, when uncertain about their
reasoning, actively direct VLMs to concentrate on and gather relevant visual
elements to support potential commonsense inferences. In our method, named
ViCor, pre-trained LLMs serve as problem classifiers to analyze the problem
category, VLM commanders to leverage VLMs differently based on the problem
classification, and visual commonsense reasoners to answer the question. VLMs
will perform visual recognition and understanding. We evaluate our framework on
two VCR benchmark datasets and outperform all other methods that do not require
in-domain supervised fine-tuning