1,161 research outputs found
ENTL: Embodied Navigation Trajectory Learner
We propose Embodied Navigation Trajectory Learner (ENTL), a method for
extracting long sequence representations for embodied navigation. Our approach
unifies world modeling, localization and imitation learning into a single
sequence prediction task. We train our model using vector-quantized predictions
of future states conditioned on current states and actions. ENTL's generic
architecture enables the sharing of the the spatio-temporal sequence encoder
for multiple challenging embodied tasks. We achieve competitive performance on
navigation tasks using significantly less data than strong baselines while
performing auxiliary tasks such as localization and future frame prediction (a
proxy for world modeling). A key property of our approach is that the model is
pre-trained without any explicit reward signal, which makes the resulting model
generalizable to multiple tasks and environments
Leveraging self-supervision for visual embodied navigation with neuralized potential fields
Une tâche fondamentale en robotique consiste à naviguer entre deux endroits. En particulier, la navigation dans le monde réel nécessite une planification à long terme à l'aide d'images RVB (RGB) en haute dimension, ce qui constitue un défi considérable pour les approches d'apprentissage de bout-en-bout. Les méthodes semi-paramétriques actuelles parviennent plutôt à atteindre des objectifs éloignés en combinant des modèles paramétriques avec une mémoire topologique de l'environnement, souvent représentée sous forme d'un graphe ayant pour nœuds des images précédemment vues. Cependant, l'utilisation de ces graphes implique généralement l'ajustement d'heuristiques d'élagage afin d'éviter les arêtes superflues, limiter la mémoire requise et permettre des recherches raisonnablement rapides dans le graphe.
Dans cet ouvrage, nous montrons comment les approches de bout-en-bout basées sur l'apprentissage auto-supervisé peuvent exceller dans des tâches de navigation à long terme. Nous présentons initialement Duckie-Former (DF), une approche de bout-en-bout pour la navigation visuelle dans des environnements routiers. En utilisant un Vision Transformer (ViT) pré-entraîné avec une méthode auto-supervisée, nous nous inspirons des champs de potentiels afin de dériver une stratégie de navigation utilisant en entrée un masque de segmentation d'image de faible résolution. DF est évalué dans des tâches de navigation de suivi de voie et d'évitement d'obstacles. Nous présentons ensuite notre deuxième approche intitulée One-4-All (O4A). O4A utilise l'apprentissage auto-supervisé et l'apprentissage de variétés afin de créer un pipeline de navigation de bout-en-bout sans graphe permettant de spécifier l'objectif à l'aide d'une image. La navigation est réalisée en minimisant de manière vorace une fonction de potentiel définie de manière continue dans l'espace latent O4A.
Les deux systèmes sont entraînés sans interagir avec le simulateur ou le robot sur des séquences d'exploration de données RVB et de contrôles non experts. Ils ne nécessitent aucune mesure de profondeur ou de pose. L'évaluation est effectuée dans des environnements simulés et réels en utilisant un robot à entraînement différentiel.A fundamental task in robotics is to navigate between two locations. Particularly, real-world navigation can require long-horizon planning using high-dimensional RGB images, which poses a substantial challenge for end-to-end learning-based approaches. Current semi-parametric methods instead achieve long-horizon navigation by combining learned modules with a topological memory of the environment, often represented as a graph over previously collected images. However, using these graphs in practice typically involves tuning various pruning heuristics to prevent spurious edges, limit runtime memory usage, and allow reasonably fast graph queries.
In this work, we show how end-to-end approaches trained through Self-Supervised Learning (SSL) can excel in long-horizon navigation tasks. We initially present Duckie-Former (DF), an end-to-end approach for visual servoing in road-like environments. Using a Vision Transformer (ViT) pretrained with a self-supervised method, we derive a potential-fields-like navigation strategy based on a coarse image segmentation model. DF is assessed in the navigation tasks of lane-following and obstacle avoidance. Subsequently, we introduce our second approach called One-4-All (O4A). O4A leverages SSL and manifold learning to create a graph-free, end-to-end navigation pipeline whose goal is specified as an image. Navigation is achieved by greedily minimizing a potential function defined continuously over the O4A latent space. O4A is evaluated in complex indoor environments.
Both systems are trained offline on non-expert exploration sequences of RGB data and controls, and do not require any depth or pose measurements. Assessment is performed in simulated and real-world environments using a differential-drive robot
Target-Driven Structured Transformer Planner for Vision-Language Navigation
Vision-language navigation is the task of directing an embodied agent to
navigate in 3D scenes with natural language instructions. For the agent,
inferring the long-term navigation target from visual-linguistic clues is
crucial for reliable path planning, which, however, has rarely been studied
before in literature. In this article, we propose a Target-Driven Structured
Transformer Planner (TD-STP) for long-horizon goal-guided and room layout-aware
navigation. Specifically, we devise an Imaginary Scene Tokenization mechanism
for explicit estimation of the long-term target (even located in unexplored
environments). In addition, we design a Structured Transformer Planner which
elegantly incorporates the explored room layout into a neural attention
architecture for structured and global planning. Experimental results
demonstrate that our TD-STP substantially improves previous best methods'
success rate by 2% and 5% on the test set of R2R and REVERIE benchmarks,
respectively. Our code is available at https://github.com/YushengZhao/TD-STP
TAIL: Task-specific Adapters for Imitation Learning with Large Pretrained Models
The full potential of large pretrained models remains largely untapped in
control domains like robotics. This is mainly because of the scarcity of data
and the computational challenges associated with training or fine-tuning these
large models for such applications. Prior work mainly emphasizes effective
pretraining of large models for decision-making, with little exploration into
how to perform data-efficient continual adaptation of these models for new
tasks. Recognizing these constraints, we introduce TAIL (Task-specific Adapters
for Imitation Learning), a framework for efficient adaptation to new control
tasks. Inspired by recent advancements in parameter-efficient fine-tuning in
language domains, we explore efficient fine-tuning techniques -- e.g.,
Bottleneck Adapters, P-Tuning, and Low-Rank Adaptation (LoRA) -- in TAIL to
adapt large pretrained models for new tasks with limited demonstration data.
Our extensive experiments in large-scale language-conditioned manipulation
tasks comparing prevalent parameter-efficient fine-tuning techniques and
adaptation baselines suggest that TAIL with LoRA can achieve the best
post-adaptation performance with only 1\% of the trainable parameters of full
fine-tuning, while avoiding catastrophic forgetting and preserving adaptation
plasticity in continual learning settings.Comment: 21 pages, 8 figures, 8 table
Exploiting Scene-specific Features for Object Goal Navigation
Can the intrinsic relation between an object and the room in which it is
usually located help agents in the Visual Navigation Task? We study this
question in the context of Object Navigation, a problem in which an agent has
to reach an object of a specific class while moving in a complex domestic
environment. In this paper, we introduce a new reduced dataset that speeds up
the training of navigation models, a notoriously complex task. Our proposed
dataset permits the training of models that do not exploit online-built maps in
reasonable times even without the use of huge computational resources.
Therefore, this reduced dataset guarantees a significant benchmark and it can
be used to identify promising models that could be then tried on bigger and
more challenging datasets. Subsequently, we propose the SMTSC model, an
attention-based model capable of exploiting the correlation between scenes and
objects contained in them, highlighting quantitatively how the idea is correct.Comment: Accepted at ACVR2020 ECCV2020 Worksho
EgoEnv: Human-centric environment representations from egocentric video
First-person video highlights a camera-wearer's activities in the context of
their persistent environment. However, current video understanding approaches
reason over visual features from short video clips that are detached from the
underlying physical space and capture only what is immediately visible. To
facilitate human-centric environment understanding, we present an approach that
links egocentric video and the environment by learning representations that are
predictive of the camera-wearer's (potentially unseen) local surroundings. We
train such models using videos from agents in simulated 3D environments where
the environment is fully observable, and test them on human-captured real-world
videos from unseen environments. On two human-centric video tasks, we show that
models equipped with our environment-aware features consistently outperform
their counterparts with traditional clip features. Moreover, despite being
trained exclusively on simulated videos, our approach successfully handles
real-world videos from HouseTours and Ego4D, and achieves state-of-the-art
results on the Ego4D NLQ challenge. Project page:
https://vision.cs.utexas.edu/projects/ego-env/Comment: Published in NeurIPS 2023 (Oral
Large Language Models for Robotics: A Survey
The human ability to learn, generalize, and control complex manipulation
tasks through multi-modality feedback suggests a unique capability, which we
refer to as dexterity intelligence. Understanding and assessing this
intelligence is a complex task. Amidst the swift progress and extensive
proliferation of large language models (LLMs), their applications in the field
of robotics have garnered increasing attention. LLMs possess the ability to
process and generate natural language, facilitating efficient interaction and
collaboration with robots. Researchers and engineers in the field of robotics
have recognized the immense potential of LLMs in enhancing robot intelligence,
human-robot interaction, and autonomy. Therefore, this comprehensive review
aims to summarize the applications of LLMs in robotics, delving into their
impact and contributions to key areas such as robot control, perception,
decision-making, and path planning. We first provide an overview of the
background and development of LLMs for robotics, followed by a description of
the benefits of LLMs for robotics and recent advancements in robotics models
based on LLMs. We then delve into the various techniques used in the model,
including those employed in perception, decision-making, control, and
interaction. Finally, we explore the applications of LLMs in robotics and some
potential challenges they may face in the near future. Embodied intelligence is
the future of intelligent science, and LLMs-based robotics is one of the
promising but challenging paths to achieve this.Comment: Preprint. 4 figures, 3 table
End-to-End (Instance)-Image Goal Navigation through Correspondence as an Emergent Phenomenon
Most recent work in goal oriented visual navigation resorts to large-scale
machine learning in simulated environments. The main challenge lies in learning
compact representations generalizable to unseen environments and in learning
high-capacity perception modules capable of reasoning on high-dimensional
input. The latter is particularly difficult when the goal is not given as a
category ("ObjectNav") but as an exemplar image ("ImageNav"), as the perception
module needs to learn a comparison strategy requiring to solve an underlying
visual correspondence problem. This has been shown to be difficult from reward
alone or with standard auxiliary tasks. We address this problem through a
sequence of two pretext tasks, which serve as a prior for what we argue is one
of the main bottleneck in perception, extremely wide-baseline relative pose
estimation and visibility prediction in complex scenes. The first pretext task,
cross-view completion is a proxy for the underlying visual correspondence
problem, while the second task addresses goal detection and finding directly.
We propose a new dual encoder with a large-capacity binocular ViT model and
show that correspondence solutions naturally emerge from the training signals.
Experiments show significant improvements and SOTA performance on the two
benchmarks, ImageNav and the Instance-ImageNav variant, where camera intrinsics
and height differ between observation and goal
Sonicverse: A Multisensory Simulation Platform for Embodied Household Agents that See and Hear
Developing embodied agents in simulation has been a key research topic in
recent years. Exciting new tasks, algorithms, and benchmarks have been
developed in various simulators. However, most of them assume deaf agents in
silent environments, while we humans perceive the world with multiple senses.
We introduce Sonicverse, a multisensory simulation platform with integrated
audio-visual simulation for training household agents that can both see and
hear. Sonicverse models realistic continuous audio rendering in 3D environments
in real-time. Together with a new audio-visual VR interface that allows humans
to interact with agents with audio, Sonicverse enables a series of embodied AI
tasks that need audio-visual perception. For semantic audio-visual navigation
in particular, we also propose a new multi-task learning model that achieves
state-of-the-art performance. In addition, we demonstrate Sonicverse's realism
via sim-to-real transfer, which has not been achieved by other simulators: an
agent trained in Sonicverse can successfully perform audio-visual navigation in
real-world environments. Sonicverse is available at:
https://github.com/StanfordVL/Sonicverse.Comment: In ICRA 2023. Project page:
https://ai.stanford.edu/~rhgao/sonicverse/. Code:
https://github.com/StanfordVL/sonicverse. Gao and Li contributed equally to
this work and are in alphabetical orde
Modeling Dynamic Environments with Scene Graph Memory
Embodied AI agents that search for objects in large environments such as
households often need to make efficient decisions by predicting object
locations based on partial information. We pose this as a new type of link
prediction problem: link prediction on partially observable dynamic graphs. Our
graph is a representation of a scene in which rooms and objects are nodes, and
their relationships are encoded in the edges; only parts of the changing graph
are known to the agent at each timestep. This partial observability poses a
challenge to existing link prediction approaches, which we address. We propose
a novel state representation -- Scene Graph Memory (SGM) -- with captures the
agent's accumulated set of observations, as well as a neural net architecture
called a Node Edge Predictor (NEP) that extracts information from the SGM to
search efficiently. We evaluate our method in the Dynamic House Simulator, a
new benchmark that creates diverse dynamic graphs following the semantic
patterns typically seen at homes, and show that NEP can be trained to predict
the locations of objects in a variety of environments with diverse object
movement dynamics, outperforming baselines both in terms of new scene
adaptability and overall accuracy. The codebase and more can be found at
https://www.scenegraphmemory.com
- …