Visual reasoning requires multimodal perception and commonsense cognition of
the world. Recently, multiple vision-language models (VLMs) have been proposed
with excellent commonsense reasoning ability in various domains. However, how
to harness the collective power of these complementary VLMs is rarely explored.
Existing methods like ensemble still struggle to aggregate these models with
the desired higher-order communications. In this work, we propose Cola, a novel
paradigm that coordinates multiple VLMs for visual reasoning. Our key insight
is that a large language model (LLM) can efficiently coordinate multiple VLMs
by facilitating natural language communication that leverages their distinct
and complementary capabilities. Extensive experiments demonstrate that our
instruction tuning variant, Cola-FT, achieves state-of-the-art performance on
visual question answering (VQA), outside knowledge VQA, visual entailment, and
visual spatial reasoning tasks. Moreover, we show that our in-context learning
variant, Cola-Zero, exhibits competitive performance in zero and few-shot
settings, without finetuning. Through systematic ablation studies and
visualizations, we validate that a coordinator LLM indeed comprehends the
instruction prompts as well as the separate functionalities of VLMs; it then
coordinates them to enable impressive visual reasoning capabilities.Comment: Accepted at NeurIPS 202