3,226 research outputs found
AVIS: Autonomous Visual Information Seeking with Large Language Model Agent
In this paper, we propose an autonomous information seeking visual question
answering framework, AVIS. Our method leverages a Large Language Model (LLM) to
dynamically strategize the utilization of external tools and to investigate
their outputs, thereby acquiring the indispensable knowledge needed to provide
answers to the posed questions. Responding to visual questions that necessitate
external knowledge, such as "What event is commemorated by the building
depicted in this image?", is a complex task. This task presents a combinatorial
search space that demands a sequence of actions, including invoking APIs,
analyzing their responses, and making informed decisions. We conduct a user
study to collect a variety of instances of human decision-making when faced
with this task. This data is then used to design a system comprised of three
components: an LLM-powered planner that dynamically determines which tool to
use next, an LLM-powered reasoner that analyzes and extracts key information
from the tool outputs, and a working memory component that retains the acquired
information throughout the process. The collected user behavior serves as a
guide for our system in two key ways. First, we create a transition graph by
analyzing the sequence of decisions made by users. This graph delineates
distinct states and confines the set of actions available at each state.
Second, we use examples of user decision-making to provide our LLM-powered
planner and reasoner with relevant contextual instances, enhancing their
capacity to make informed decisions. We show that AVIS achieves
state-of-the-art results on knowledge-intensive visual question answering
benchmarks such as Infoseek and OK-VQA.Comment: Published on NeurIPS 202
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models
The field of vision-and-language (VL) understanding has made unprecedented
progress with end-to-end large pre-trained VL models (VLMs). However, they
still fall short in zero-shot reasoning tasks that require multi-step
inferencing. To achieve this goal, previous works resort to a
divide-and-conquer pipeline. In this paper, we argue that previous efforts have
several inherent shortcomings: 1) They rely on domain-specific sub-question
decomposing models. 2) They force models to predict the final answer even if
the sub-questions or sub-answers provide insufficient information. We address
these limitations via IdealGPT, a framework that iteratively decomposes VL
reasoning using large language models (LLMs). Specifically, IdealGPT utilizes
an LLM to generate sub-questions, a VLM to provide corresponding sub-answers,
and another LLM to reason to achieve the final answer. These three modules
perform the divide-and-conquer procedure iteratively until the model is
confident about the final answer to the main question. We evaluate IdealGPT on
multiple challenging VL reasoning tasks under a zero-shot setting. In
particular, our IdealGPT outperforms the best existing GPT-4-like models by an
absolute 10% on VCR and 15% on SNLI-VE. Code is available at
https://github.com/Hxyou/IdealGPTComment: 13 pages, 5 figure
- …