2,167 research outputs found
DoReMi: Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment
Large language models encode a vast amount of semantic knowledge and possess
remarkable understanding and reasoning capabilities. Previous research has
explored how to ground language models in robotic tasks to ensure that the
sequences generated by the language model are both logically correct and
practically executable. However, low-level execution may deviate from the
high-level plan due to environmental perturbations or imperfect controller
design. In this paper, we propose DoReMi, a novel language model grounding
framework that enables immediate Detection and Recovery from Misalignments
between plan and execution. Specifically, LLMs are leveraged for both planning
and generating constraints for planned steps. These constraints can indicate
plan-execution misalignments and we use a vision question answering (VQA) model
to check constraints during low-level skill execution. If certain misalignment
occurs, our method will call the language model to re-plan in order to recover
from misalignments. Experiments on various complex tasks including robot arms
and humanoid robots demonstrate that our method can lead to higher task success
rates and shorter task completion times. Videos of DoReMi are available at
https://sites.google.com/view/doremi-paper.Comment: 21 pages, 13 figure
Extension of the Control Concept for a Mobile Overhead Manipulator to Whole-Body Impedance Control
At present, robots constitute a central component of contemporary factories. The application of traditional ground-based systems, however, may lead to congested floors with minimal space left for new robots or human workers. Overhead manipulators, on the other hand, aim to occupy the unutilized ceiling space, in order to manipulate the workspace located below them. The SwarmRail system is an example of such an overhead manipulator. This concept deploys mobile units driving across a passive railstructure above the ground. Additionally, equipping the mobile units with robotic arms at their bottom side enables this design to provide continuous overhead manipulation while in motion. Although a first demonstrator confirmed the functional capability of said system, the current hardware suffers from complications while traversing rail crossings. Due to uneven rails consecutive rails, said crossing points cause the robot's wheels to collide with the new rail segment it is driving towards. Additionally, the robot experiences an undesired sudden altitude change.
In this thesis, we aim to implement a hierarchical whole-body impedance tracking controller for the robots employed within the SwarmRail system. Our controller combines a kinematically controlled mobile unit with the impedance-based control of a robotic arm through an admittance interface. The focus of this thesis is set on the controller's robustness against the previously mentioned external disturbances. The performance of this controller is validated inside a simulation that incorporates the aforementioned complications. Our findings suggest, that the control strategy presented in this thesis provides a foundation for the development of a controller applicable to the physical demonstrator
A Survey of Imitation Learning: Algorithms, Recent Developments, and Challenges
In recent years, the development of robotics and artificial intelligence (AI)
systems has been nothing short of remarkable. As these systems continue to
evolve, they are being utilized in increasingly complex and unstructured
environments, such as autonomous driving, aerial robotics, and natural language
processing. As a consequence, programming their behaviors manually or defining
their behavior through reward functions (as done in reinforcement learning
(RL)) has become exceedingly difficult. This is because such environments
require a high degree of flexibility and adaptability, making it challenging to
specify an optimal set of rules or reward signals that can account for all
possible situations. In such environments, learning from an expert's behavior
through imitation is often more appealing. This is where imitation learning
(IL) comes into play - a process where desired behavior is learned by imitating
an expert's behavior, which is provided through demonstrations.
This paper aims to provide an introduction to IL and an overview of its
underlying assumptions and approaches. It also offers a detailed description of
recent advances and emerging areas of research in the field. Additionally, the
paper discusses how researchers have addressed common challenges associated
with IL and provides potential directions for future research. Overall, the
goal of the paper is to provide a comprehensive guide to the growing field of
IL in robotics and AI.Comment: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessibl
Reconstruction and Synthesis of Human-Scene Interaction
In this thesis, we argue that the 3D scene is vital for understanding, reconstructing, and synthesizing human motion. We present several approaches which take the scene into consideration in reconstructing and synthesizing Human-Scene Interaction (HSI). We first observe that state-of-the-art pose estimation methods ignore the 3D scene and hence reconstruct poses that are inconsistent with the scene. We address this by proposing a pose estimation method that takes the 3D scene explicitly into account. We call our method PROX for Proximal Relationships with Object eXclusion. We leverage the data generated using PROX and build a method to automatically place 3D scans of people with clothing in scenes. The core novelty of our method is encoding the proximal relationships between the human and the scene in a novel HSI model, called POSA for Pose with prOximitieS and contActs. POSA is limited to static HSI, however. We propose a real-time method for synthesizing dynamic HSI, which we call SAMP for Scene-Aware Motion Prediction. SAMP enables virtual humans to navigate cluttered indoor scenes and naturally interact with objects. Data-driven kinematic models, like SAMP, can produce high-quality motion when applied in environments similar to those shown in the dataset. However, when applied to new scenarios, kinematic models can struggle to generate realistic behaviors that respect scene constraints. In contrast, we present InterPhys which uses adversarial imitation learning and reinforcement learning to train physically-simulated characters that perform scene interaction tasks in a physical and life-like manner
METRA: Scalable Unsupervised RL with Metric-Aware Abstraction
Unsupervised pre-training strategies have proven to be highly effective in
natural language processing and computer vision. Likewise, unsupervised
reinforcement learning (RL) holds the promise of discovering a variety of
potentially useful behaviors that can accelerate the learning of a wide array
of downstream tasks. Previous unsupervised RL approaches have mainly focused on
pure exploration and mutual information skill learning. However, despite the
previous attempts, making unsupervised RL truly scalable still remains a major
open challenge: pure exploration approaches might struggle in complex
environments with large state spaces, where covering every possible transition
is infeasible, and mutual information skill learning approaches might
completely fail to explore the environment due to the lack of incentives. To
make unsupervised RL scalable to complex, high-dimensional environments, we
propose a novel unsupervised RL objective, which we call Metric-Aware
Abstraction (METRA). Our main idea is, instead of directly covering the entire
state space, to only cover a compact latent space that is metrically
connected to the state space by temporal distances. By learning to move in
every direction in the latent space, METRA obtains a tractable set of diverse
behaviors that approximately cover the state space, being scalable to
high-dimensional environments. Through our experiments in five locomotion and
manipulation environments, we demonstrate that METRA can discover a variety of
useful behaviors even in complex, pixel-based environments, being the first
unsupervised RL method that discovers diverse locomotion behaviors in
pixel-based Quadruped and Humanoid. Our code and videos are available at
https://seohong.me/projects/metra
Probabilistic Inference for Model Based Control
Robotic systems are essential for enhancing productivity, automation, and performing hazardous tasks. Addressing the unpredictability of physical systems, this thesis advances robotic planning and control under uncertainty, introducing learning-based methods for managing uncertain parameters and adapting to changing environments in real-time.
Our first contribution is a framework using Bayesian statistics for likelihood-free inference of model parameters. This allows employing complex simulators for designing efficient, robust controllers. The method, integrating the unscented transform with a variant of information theoretical model predictive control, shows better performance in trajectory evaluation compared to Monte Carlo sampling, easing the computational load in various control and robotics tasks.
Next, we reframe robotic planning and control as a Bayesian inference problem, focusing on the posterior distribution of actions and model parameters. An implicit variational inference algorithm, performing Stein Variational Gradient Descent, estimates distributions over model parameters and control inputs in real-time. This Bayesian approach effectively handles complex multi-modal posterior distributions, vital for dynamic and realistic robot navigation.
Finally, we tackle diversity in high-dimensional spaces. Our approach mitigates underestimation of uncertainty in posterior distributions, which leads to locally optimal solutions. Using the theory of rough paths, we develop an algorithm for parallel trajectory optimisation, enhancing solution diversity and avoiding mode collapse. This method extends our variational inference approach for trajectory estimation, employing diversity-enhancing kernels and leveraging path signature representation of trajectories. Empirical tests, ranging from 2-D navigation to robotic manipulators in cluttered environments, affirm our method's efficiency, outperforming existing alternatives
Extending the motion planning framework—MoveIt with advanced manipulation functions for industrial applications
MoveIt is the primary software library for motion planning and mobile manipulation in ROS, and it incorporates the latest advances in motion planning, control and perception. However, it is still quite recent, and some important functions to build more advanced manipulation applications, required to robotize many manufacturing processes, have not been developed yet. MoveIt is an open source software, and it relies on the contributions from its community to keep improving and adding new features. Therefore, in this paper, its current state is analyzed to find out which are its main necessities and provide a solution to them. In particular, three gaps of MoveIt are addressed: the automatic tool changing at runtime, the generation of trajectories with full control over the end effector path and speed, and the generation of dual-arm trajectories using different synchronization policies. These functions have been tested with a Motoman SDA10F dual-arm robot, demonstrating their validity in different scenarios. All the developed solutions are generic and robot-agnostic, and they are openly available to be used to extend the capabilities of MoveIt.publishedVersionPeer reviewe
A Comprehensive Review of Data-Driven Co-Speech Gesture Generation
Gestures that accompany speech are an essential part of natural and efficient
embodied human communication. The automatic generation of such co-speech
gestures is a long-standing problem in computer animation and is considered an
enabling technology in film, games, virtual social spaces, and for interaction
with social robots. The problem is made challenging by the idiosyncratic and
non-periodic nature of human co-speech gesture motion, and by the great
diversity of communicative functions that gestures encompass. Gesture
generation has seen surging interest recently, owing to the emergence of more
and larger datasets of human gesture motion, combined with strides in
deep-learning-based generative models, that benefit from the growing
availability of data. This review article summarizes co-speech gesture
generation research, with a particular focus on deep generative models. First,
we articulate the theory describing human gesticulation and how it complements
speech. Next, we briefly discuss rule-based and classical statistical gesture
synthesis, before delving into deep learning approaches. We employ the choice
of input modalities as an organizing principle, examining systems that generate
gestures from audio, text, and non-linguistic input. We also chronicle the
evolution of the related training data sets in terms of size, diversity, motion
quality, and collection method. Finally, we identify key research challenges in
gesture generation, including data availability and quality; producing
human-like motion; grounding the gesture in the co-occurring speech in
interaction with other speakers, and in the environment; performing gesture
evaluation; and integration of gesture synthesis into applications. We
highlight recent approaches to tackling the various key challenges, as well as
the limitations of these approaches, and point toward areas of future
development.Comment: Accepted for EUROGRAPHICS 202
ACE: Adversarial Correspondence Embedding for Cross Morphology Motion Retargeting from Human to Nonhuman Characters
Motion retargeting is a promising approach for generating natural and
compelling animations for nonhuman characters. However, it is challenging to
translate human movements into semantically equivalent motions for target
characters with different morphologies due to the ambiguous nature of the
problem. This work presents a novel learning-based motion retargeting
framework, Adversarial Correspondence Embedding (ACE), to retarget human
motions onto target characters with different body dimensions and structures.
Our framework is designed to produce natural and feasible robot motions by
leveraging generative-adversarial networks (GANs) while preserving high-level
motion semantics by introducing an additional feature loss. In addition, we
pretrain a robot motion prior that can be controlled in a latent embedding
space and seek to establish a compact correspondence. We demonstrate that the
proposed framework can produce retargeted motions for three different
characters -- a quadrupedal robot with a manipulator, a crab character, and a
wheeled manipulator. We further validate the design choices of our framework by
conducting baseline comparisons and a user study. We also showcase sim-to-real
transfer of the retargeted motions by transferring them to a real Spot robot
- …