5,076 research outputs found
Agent AI: Surveying the Horizons of Multimodal Interaction
Multi-modal AI systems will likely become a ubiquitous presence in our
everyday lives. A promising approach to making these systems more interactive
is to embody them as agents within physical and virtual environments. At
present, systems leverage existing foundation models as the basic building
blocks for the creation of embodied agents. Embedding agents within such
environments facilitates the ability of models to process and interpret visual
and contextual data, which is critical for the creation of more sophisticated
and context-aware AI systems. For example, a system that can perceive user
actions, human behavior, environmental objects, audio expressions, and the
collective sentiment of a scene can be used to inform and direct agent
responses within the given environment. To accelerate research on agent-based
multimodal intelligence, we define "Agent AI" as a class of interactive systems
that can perceive visual stimuli, language inputs, and other
environmentally-grounded data, and can produce meaningful embodied actions. In
particular, we explore systems that aim to improve agents based on
next-embodied action prediction by incorporating external knowledge,
multi-sensory inputs, and human feedback. We argue that by developing agentic
AI systems in grounded environments, one can also mitigate the hallucinations
of large foundation models and their tendency to generate environmentally
incorrect outputs. The emerging field of Agent AI subsumes the broader embodied
and agentic aspects of multimodal interactions. Beyond agents acting and
interacting in the physical world, we envision a future where people can easily
create any virtual reality or simulated scene and interact with agents embodied
within the virtual environment
Large Language Model-based Human-Agent Collaboration for Complex Task Solving
In recent developments within the research community, the integration of
Large Language Models (LLMs) in creating fully autonomous agents has garnered
significant interest. Despite this, LLM-based agents frequently demonstrate
notable shortcomings in adjusting to dynamic environments and fully grasping
human needs. In this work, we introduce the problem of LLM-based human-agent
collaboration for complex task-solving, exploring their synergistic potential.
In addition, we propose a Reinforcement Learning-based Human-Agent
Collaboration method, ReHAC. This approach includes a policy model designed to
determine the most opportune stages for human intervention within the
task-solving process. We construct a human-agent collaboration dataset to train
this policy model in an offline reinforcement learning environment. Our
validation tests confirm the model's effectiveness. The results demonstrate
that the synergistic efforts of humans and LLM-based agents significantly
improve performance in complex tasks, primarily through well-planned, limited
human intervention. Datasets and code are available at:
https://github.com/XueyangFeng/ReHAC
Communicative Agents for Software Development
Software engineering is a domain characterized by intricate decision-making
processes, often relying on nuanced intuition and consultation. Recent
advancements in deep learning have started to revolutionize software
engineering practices through elaborate designs implemented at various stages
of software development. In this paper, we present an innovative paradigm that
leverages large language models (LLMs) throughout the entire software
development process, streamlining and unifying key processes through natural
language communication, thereby eliminating the need for specialized models at
each phase. At the core of this paradigm lies ChatDev, a virtual chat-powered
software development company that mirrors the established waterfall model,
meticulously dividing the development process into four distinct chronological
stages: designing, coding, testing, and documenting. Each stage engages a team
of agents, such as programmers, code reviewers, and test engineers, fostering
collaborative dialogue and facilitating a seamless workflow. The chat chain
acts as a facilitator, breaking down each stage into atomic subtasks. This
enables dual roles, allowing for proposing and validating solutions through
context-aware communication, leading to efficient resolution of specific
subtasks. The instrumental analysis of ChatDev highlights its remarkable
efficacy in software generation, enabling the completion of the entire software
development process in under seven minutes at a cost of less than one dollar.
It not only identifies and alleviates potential vulnerabilities but also
rectifies potential hallucinations while maintaining commendable efficiency and
cost-effectiveness. The potential of ChatDev unveils fresh possibilities for
integrating LLMs into the realm of software development.Comment: 25 pages, 9 figures, 2 table
Large Multimodal Agents: A Survey
Large language models (LLMs) have achieved superior performance in powering
text-based AI agents, endowing them with decision-making and reasoning
abilities akin to humans. Concurrently, there is an emerging research trend
focused on extending these LLM-powered AI agents into the multimodal domain.
This extension enables AI agents to interpret and respond to diverse multimodal
user queries, thereby handling more intricate and nuanced tasks. In this paper,
we conduct a systematic review of LLM-driven multimodal agents, which we refer
to as large multimodal agents ( LMAs for short). First, we introduce the
essential components involved in developing LMAs and categorize the current
body of research into four distinct types. Subsequently, we review the
collaborative frameworks integrating multiple LMAs , enhancing collective
efficacy. One of the critical challenges in this field is the diverse
evaluation methods used across existing studies, hindering effective comparison
among different LMAs . Therefore, we compile these evaluation methodologies and
establish a comprehensive framework to bridge the gaps. This framework aims to
standardize evaluations, facilitating more meaningful comparisons. Concluding
our review, we highlight the extensive applications of LMAs and propose
possible future research directions. Our discussion aims to provide valuable
insights and guidelines for future research in this rapidly evolving field. An
up-to-date resource list is available at
https://github.com/jun0wanan/awesome-large-multimodal-agents.Comment: 15 pages, 4 figure
- …