42 research outputs found
CityLearn: Diverse Real-World Environments for Sample-Efficient Navigation Policy Learning
Visual navigation tasks in real-world environments often require both
self-motion and place recognition feedback. While deep reinforcement learning
has shown success in solving these perception and decision-making problems in
an end-to-end manner, these algorithms require large amounts of experience to
learn navigation policies from high-dimensional data, which is generally
impractical for real robots due to sample complexity. In this paper, we address
these problems with two main contributions. We first leverage place recognition
and deep learning techniques combined with goal destination feedback to
generate compact, bimodal image representations that can then be used to
effectively learn control policies from a small amount of experience. Second,
we present an interactive framework, CityLearn, that enables for the first time
training and deployment of navigation algorithms across city-sized, realistic
environments with extreme visual appearance changes. CityLearn features more
than 10 benchmark datasets, often used in visual place recognition and
autonomous driving research, including over 100 recorded traversals across 60
cities around the world. We evaluate our approach on two CityLearn
environments, training our navigation policy on a single traversal. Results
show our method can be over 2 orders of magnitude faster than when using raw
images, and can also generalize across extreme visual changes including day to
night and summer to winter transitions.Comment: Preprint version of article accepted to ICRA 202
VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation
Outdoor Vision-and-Language Navigation (VLN) requires an agent to navigate
through realistic 3D outdoor environments based on natural language
instructions. The performance of existing VLN methods is limited by
insufficient diversity in navigation environments and limited training data. To
address these issues, we propose VLN-Video, which utilizes the diverse outdoor
environments present in driving videos in multiple cities in the U.S. augmented
with automatically generated navigation instructions and actions to improve
outdoor VLN performance. VLN-Video combines the best of intuitive classical
approaches and modern deep learning techniques, using template infilling to
generate grounded navigation instructions, combined with an image rotation
similarity-based navigation action predictor to obtain VLN style data from
driving videos for pretraining deep learning VLN models. We pre-train the model
on the Touchdown dataset and our video-augmented dataset created from driving
videos with three proxy tasks: Masked Language Modeling, Instruction and
Trajectory Matching, and Next Action Prediction, so as to learn
temporally-aware and visually-aligned instruction representations. The learned
instruction representation is adapted to the state-of-the-art navigator when
fine-tuning on the Touchdown dataset. Empirical results demonstrate that
VLN-Video significantly outperforms previous state-of-the-art models by 2.1% in
task completion rate, achieving a new state-of-the-art on the Touchdown
dataset.Comment: AAAI 202