7,561 research outputs found
Large Language Models for Robotics: A Survey
The human ability to learn, generalize, and control complex manipulation
tasks through multi-modality feedback suggests a unique capability, which we
refer to as dexterity intelligence. Understanding and assessing this
intelligence is a complex task. Amidst the swift progress and extensive
proliferation of large language models (LLMs), their applications in the field
of robotics have garnered increasing attention. LLMs possess the ability to
process and generate natural language, facilitating efficient interaction and
collaboration with robots. Researchers and engineers in the field of robotics
have recognized the immense potential of LLMs in enhancing robot intelligence,
human-robot interaction, and autonomy. Therefore, this comprehensive review
aims to summarize the applications of LLMs in robotics, delving into their
impact and contributions to key areas such as robot control, perception,
decision-making, and path planning. We first provide an overview of the
background and development of LLMs for robotics, followed by a description of
the benefits of LLMs for robotics and recent advancements in robotics models
based on LLMs. We then delve into the various techniques used in the model,
including those employed in perception, decision-making, control, and
interaction. Finally, we explore the applications of LLMs in robotics and some
potential challenges they may face in the near future. Embodied intelligence is
the future of intelligent science, and LLMs-based robotics is one of the
promising but challenging paths to achieve this.Comment: Preprint. 4 figures, 3 table
On Realization of Intelligent Decision-Making in the Real World: A Foundation Decision Model Perspective
Our situated environment is full of uncertainty and highly dynamic, thus
hindering the widespread adoption of machine-led Intelligent Decision-Making
(IDM) in real world scenarios. This means IDM should have the capability of
continuously learning new skills and efficiently generalizing across wider
applications. IDM benefits from any new approaches and theoretical
breakthroughs that exhibit Artificial General Intelligence (AGI) breaking the
barriers between tasks and applications. Recent research has well-examined
neural architecture, Transformer, as a backbone foundation model and its
generalization to various tasks, including computer vision, natural language
processing, and reinforcement learning. We therefore argue that a foundation
decision model (FDM) can be established by formulating various decision-making
tasks as a sequence decoding task using the Transformer architecture; this
would be a promising solution to advance the applications of IDM in more
complex real world tasks. In this paper, we elaborate on how a foundation
decision model improves the efficiency and generalization of IDM. We also
discuss potential applications of a FDM in multi-agent game AI, production
scheduling, and robotics tasks. Finally, through a case study, we demonstrate
our realization of the FDM, DigitalBrain (DB1) with 1.2 billion parameters,
which achieves human-level performance over 453 tasks, including text
generation, images caption, video games playing, robotic control, and traveling
salesman problems. As a foundation decision model, DB1 would be a baby step
towards more autonomous and efficient real world IDM applications.Comment: 26 pages, 4 figure
Recommended from our members
On Building Generalizable Learning Agents
It has been a long-standing goal in Artificial Intelligence (AI) to build machines that can solve tasks that humans can. Thanks to the recent rapid progress in data-driven methods, which train agents to solve tasks by learning from massive training data, there have been many successes in applying such learning approaches to handle and even solve a number of extremely challenging tasks, including image classification, language generation, robotics control, and several multi-player games. The key factor for all these data-driven successes is that the trained agents can generalize to test scenarios that are unseen during training. This generalization capability is the foundation for building any practical AI system. This thesis studies generalization, the fundamental challenge in AI, and proposes solutions to improve the generalization performances of learning agents in a variety of problems. We start by providing a formal formulation of the generalization problem in the context of reinforcement learning and proposing 4 principles within this formulation to guide the design of training techniques for improved generalization. We validate the effectiveness of our proposed principles by considering 4 different domains, from simple to complex, and developing domain-specific techniques following these principles. Particularly, we begin with the simplest domain, i.e., path-finding on graphs (Part I), and then consider visual navigation in a 3D world (Part II) and competition in complex multi-agent games (Part III), and lastly tackle some natural language processing tasks (Part IV). Empirical evidences demonstrate that the proposed principles can generally lead to much improved generalization performances in a wide range of problems
Look, Listen, and Act: Towards Audio-Visual Embodied Navigation
A crucial ability of mobile intelligent agents is to integrate the evidence
from multiple sensory inputs in an environment and to make a sequence of
actions to reach their goals. In this paper, we attempt to approach the problem
of Audio-Visual Embodied Navigation, the task of planning the shortest path
from a random starting location in a scene to the sound source in an indoor
environment, given only raw egocentric visual and audio sensory data. To
accomplish this task, the agent is required to learn from various modalities,
i.e. relating the audio signal to the visual environment. Here we describe an
approach to audio-visual embodied navigation that takes advantage of both
visual and audio pieces of evidence. Our solution is based on three key ideas:
a visual perception mapper module that constructs its spatial memory of the
environment, a sound perception module that infers the relative location of the
sound source from the agent, and a dynamic path planner that plans a sequence
of actions based on the audio-visual observations and the spatial memory of the
environment to navigate toward the goal. Experimental results on a newly
collected Visual-Audio-Room dataset using the simulated multi-modal environment
demonstrate the effectiveness of our approach over several competitive
baselines.Comment: Accepted by ICRA 2020. Project page: http://avn.csail.mit.ed
- …