Recent LLM-driven visual agents mainly focus on solving image-based tasks,
which limits their ability to understand dynamic scenes, making it far from
real-life applications like guiding students in laboratory experiments and
identifying their mistakes. Hence, this paper explores DoraemonGPT, a
comprehensive and conceptually elegant system driven by LLMs to understand
dynamic scenes. Considering the video modality better reflects the
ever-changing nature of real-world scenarios, we exemplify DoraemonGPT as a
video agent. Given a video with a question/task, DoraemonGPT begins by
converting the input video into a symbolic memory that stores task-related
attributes. This structured representation allows for spatial-temporal querying
and reasoning by well-designed sub-task tools, resulting in concise
intermediate results. Recognizing that LLMs have limited internal knowledge
when it comes to specialized domains (e.g., analyzing the scientific principles
underlying experiments), we incorporate plug-and-play tools to assess external
knowledge and address tasks across different domains. Moreover, a novel
LLM-driven planner based on Monte Carlo Tree Search is introduced to explore
the large planning space for scheduling various tools. The planner iteratively
finds feasible solutions by backpropagating the result's reward, and multiple
solutions can be summarized into an improved final answer. We extensively
evaluate DoraemonGPT's effectiveness on three benchmarks and several
in-the-wild scenarios. The code will be released at
https://github.com/z-x-yang/DoraemonGPT