23 research outputs found
A deep learning method based on language models for processing natural language Russian commands in human robot interaction
This paper describes the transformation process complex Russian-speaking natural language commands into a formalized graph RDF format for interaction with the robotic platfor
A System for Generalized 3D Multi-Object Search
Searching for objects is a fundamental skill for robots. As such, we expect
object search to eventually become an off-the-shelf capability for robots,
similar to e.g., object detection and SLAM. In contrast, however, no system for
3D object search exists that generalizes across real robots and environments.
In this paper, building upon a recent theoretical framework that exploited the
octree structure for representing belief in 3D, we present GenMOS (Generalized
Multi-Object Search), the first general-purpose system for multi-object search
(MOS) in a 3D region that is robot-independent and environment-agnostic. GenMOS
takes as input point cloud observations of the local region, object detection
results, and localization of the robot's view pose, and outputs a 6D viewpoint
to move to through online planning. In particular, GenMOS uses point cloud
observations in three ways: (1) to simulate occlusion; (2) to inform occupancy
and initialize octree belief; and (3) to sample a belief-dependent graph of
view positions that avoid obstacles. We evaluate our system both in simulation
and on two real robot platforms. Our system enables, for example, a Boston
Dynamics Spot robot to find a toy cat hidden underneath a couch in under one
minute. We further integrate 3D local search with 2D global search to handle
larger areas, demonstrating the resulting system in a 25m lobby area.Comment: 8 pages, 9 figures, 1 table. IEEE Conference on Robotics and
Automation (ICRA) 202
Language-Conditioned Observation Models for Visual Object Search
Object search is a challenging task because when given complex language
descriptions (e.g., "find the white cup on the table"), the robot must move its
camera through the environment and recognize the described object. Previous
works map language descriptions to a set of fixed object detectors with
predetermined noise models, but these approaches are challenging to scale
because new detectors need to be made for each object. In this work, we bridge
the gap in realistic object search by posing the search problem as a partially
observable Markov decision process (POMDP) where the object detector and visual
sensor noise in the observation model is determined by a single Deep Neural
Network conditioned on complex language descriptions. We incorporate the neural
network's outputs into our language-conditioned observation model (LCOM) to
represent dynamically changing sensor noise. With an LCOM, any language
description of an object can be used to generate an appropriate object detector
and noise model, and training an LCOM only requires readily available
supervised image-caption datasets. We empirically evaluate our method by
comparing against a state-of-the-art object search algorithm in simulation, and
demonstrate that planning with our observation model yields a significantly
higher average task completion rate (from 0.46 to 0.66) and more efficient and
quicker object search than with a fixed-noise model. We demonstrate our method
on a Boston Dynamics Spot robot, enabling it to handle complex natural language
object descriptions and efficiently find objects in a room-scale environment
Learning to Terminate in Object Navigation
This paper tackles the critical challenge of object navigation in autonomous
navigation systems, particularly focusing on the problem of target approach and
episode termination in environments with long optimal episode length in Deep
Reinforcement Learning (DRL) based methods. While effective in environment
exploration and object localization, conventional DRL methods often struggle
with optimal path planning and termination recognition due to a lack of depth
information. To overcome these limitations, we propose a novel approach, namely
the Depth-Inference Termination Agent (DITA), which incorporates a supervised
model called the Judge Model to implicitly infer object-wise depth and decide
termination jointly with reinforcement learning. We train our judge model along
with reinforcement learning in parallel and supervise the former efficiently by
reward signal. Our evaluation shows the method is demonstrating superior
performance, we achieve a 9.3% gain on success rate than our baseline method
across all room types and gain 51.2% improvements on long episodes environment
while maintaining slightly better Success Weighted by Path Length (SPL). Code
and resources, visualization are available at:
https://github.com/HuskyKingdom/DITA_acml2023Comment: 16 page
Exploiting Scene-specific Features for Object Goal Navigation
Can the intrinsic relation between an object and the room in which it is
usually located help agents in the Visual Navigation Task? We study this
question in the context of Object Navigation, a problem in which an agent has
to reach an object of a specific class while moving in a complex domestic
environment. In this paper, we introduce a new reduced dataset that speeds up
the training of navigation models, a notoriously complex task. Our proposed
dataset permits the training of models that do not exploit online-built maps in
reasonable times even without the use of huge computational resources.
Therefore, this reduced dataset guarantees a significant benchmark and it can
be used to identify promising models that could be then tried on bigger and
more challenging datasets. Subsequently, we propose the SMTSC model, an
attention-based model capable of exploiting the correlation between scenes and
objects contained in them, highlighting quantitatively how the idea is correct.Comment: Accepted at ACVR2020 ECCV2020 Worksho
Benchmarking Augmentation Methods for Learning Robust Navigation Agents: the Winning Entry of the 2021 iGibson Challenge
Recent advances in deep reinforcement learning and scalable photorealistic
simulation have led to increasingly mature embodied AI for various visual
tasks, including navigation. However, while impressive progress has been made
for teaching embodied agents to navigate static environments, much less
progress has been made on more dynamic environments that may include moving
pedestrians or movable obstacles. In this study, we aim to benchmark different
augmentation techniques for improving the agent's performance in these
challenging environments. We show that adding several dynamic obstacles into
the scene during training confers significant improvements in test-time
generalization, achieving much higher success rates than baseline agents. We
find that this approach can also be combined with image augmentation methods to
achieve even higher success rates. Additionally, we show that this approach is
also more robust to sim-to-sim transfer than image augmentation methods.
Finally, we demonstrate the effectiveness of this dynamic obstacle augmentation
approach by using it to train an agent for the 2021 iGibson Challenge at CVPR,
where it achieved 1st place for Interactive Navigation. Video link:
https://www.youtube.com/watch?v=HxUX2HeOSE
D-VAT: End-to-End Visual Active Tracking for Micro Aerial Vehicles
Visual active tracking is a growing research topic in robotics due to its key
role in applications such as human assistance, disaster recovery, and
surveillance. In contrast to passive tracking, active tracking approaches
combine vision and control capabilities to detect and actively track the
target. Most of the work in this area focuses on ground robots, while the very
few contributions on aerial platforms still pose important design constraints
that limit their applicability. To overcome these limitations, in this paper we
propose D-VAT, a novel end-to-end visual active tracking methodology based on
deep reinforcement learning that is tailored to micro aerial vehicle platforms.
The D-VAT agent computes the vehicle thrust and angular velocity commands
needed to track the target by directly processing monocular camera
measurements. We show that the proposed approach allows for precise and
collision-free tracking operations, outperforming different state-of-the-art
baselines on simulated environments which differ significantly from those
encountered during training
Success Weighted by Completion Time: A Dynamics-Aware Evaluation Criteria for Embodied Navigation
We present Success weighted by Completion Time (SCT), a new metric for
evaluating navigation performance for mobile robots. Several related works on
navigation have used Success weighted by Path Length (SPL) as the primary
method of evaluating the path an agent makes to a goal location, but SPL is
limited in its ability to properly evaluate agents with complex dynamics. In
contrast, SCT explicitly takes the agent's dynamics model into consideration,
and aims to accurately capture how well the agent has approximated the fastest
navigation behavior afforded by its dynamics. While several embodied navigation
works use point-turn dynamics, we focus on unicycle-cart dynamics for our
agent, which better exemplifies the dynamics model of popular mobile robotics
platforms (e.g., LoCoBot, TurtleBot, Fetch, etc.). We also present
RRT*-Unicycle, an algorithm for unicycle dynamics that estimates the fastest
collision-free path and completion time from a starting pose to a goal location
in an environment containing obstacles. We experiment with deep reinforcement
learning and reward shaping to train and compare the navigation performance of
agents with different dynamics models. In evaluating these agents, we show that
in contrast to SPL, SCT is able to capture the advantages in navigation speed a
unicycle model has over a simpler point-turn model of dynamics. Lastly, we show
that we can successfully deploy our trained models and algorithms outside of
simulation in the real world. We embody our agents in an real robot to navigate
an apartment, and show that they can generalize in a zero-shot manner