46 research outputs found
Urban2Vec: Incorporating Street View Imagery and POIs for Multi-Modal Urban Neighborhood Embedding
Understanding intrinsic patterns and predicting spatiotemporal
characteristics of cities require a comprehensive representation of urban
neighborhoods. Existing works relied on either inter- or intra-region
connectivities to generate neighborhood representations but failed to fully
utilize the informative yet heterogeneous data within neighborhoods. In this
work, we propose Urban2Vec, an unsupervised multi-modal framework which
incorporates both street view imagery and point-of-interest (POI) data to learn
neighborhood embeddings. Specifically, we use a convolutional neural network to
extract visual features from street view images while preserving geospatial
similarity. Furthermore, we model each POI as a bag-of-words containing its
category, rating, and review information. Analog to document embedding in
natural language processing, we establish the semantic similarity between
neighborhood ("document") and the words from its surrounding POIs in the vector
space. By jointly encoding visual, textual, and geospatial information into the
neighborhood representation, Urban2Vec can achieve performances better than
baseline models and comparable to fully-supervised methods in downstream
prediction tasks. Extensive experiments on three U.S. metropolitan areas also
demonstrate the model interpretability, generalization capability, and its
value in neighborhood similarity analysis.Comment: To appear in Proceedings of the Thirty-Fourth AAAI Conference on
Artificial Intelligence (AAAI-20
SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing
Remote sensing imagery, despite its broad applications in helping achieve
Sustainable Development Goals and tackle climate change, has not yet benefited
from the recent advancements of versatile, task-agnostic vision language models
(VLMs). A key reason is that the large-scale, semantically diverse image-text
dataset required for developing VLMs is still absent for remote sensing images.
Unlike natural images, remote sensing images and their associated text
descriptions cannot be efficiently collected from the public Internet at scale.
In this work, we bridge this gap by using geo-coordinates to automatically
connect open, unlabeled remote sensing images with rich semantics covered in
OpenStreetMap, and thus construct SkyScript, a comprehensive vision-language
dataset for remote sensing images, comprising 2.6 million image-text pairs
covering 29K distinct semantic tags. With continual pre-training on this
dataset, we obtain a VLM that surpasses baseline models with a 6.2% average
accuracy gain in zero-shot scene classification across seven benchmark
datasets. It also demonstrates the ability of zero-shot transfer for
fine-grained object attribute classification and cross-modal retrieval. We hope
this dataset can support the advancement of VLMs for various multi-modal tasks
in remote sensing, such as open-vocabulary classification, retrieval,
captioning, and text-to-image synthesis.Comment: Accepted by AAAI 202
RL-ViGen: A Reinforcement Learning Benchmark for Visual Generalization
Visual Reinforcement Learning (Visual RL), coupled with high-dimensional
observations, has consistently confronted the long-standing challenge of
generalization. Despite the focus on algorithms aimed at resolving visual
generalization problems, we argue that the devil is in the existing benchmarks
as they are restricted to isolated tasks and generalization categories,
undermining a comprehensive evaluation of agents' visual generalization
capabilities. To bridge this gap, we introduce RL-ViGen: a novel Reinforcement
Learning Benchmark for Visual Generalization, which contains diverse tasks and
a wide spectrum of generalization types, thereby facilitating the derivation of
more reliable conclusions. Furthermore, RL-ViGen incorporates the latest
generalization visual RL algorithms into a unified framework, under which the
experiment results indicate that no single existing algorithm has prevailed
universally across tasks. Our aspiration is that RL-ViGen will serve as a
catalyst in this area, and lay a foundation for the future creation of
universal visual generalization RL agents suitable for real-world scenarios.
Access to our code and implemented algorithms is provided at
https://gemcollector.github.io/RL-ViGen/
GenSim: Generating Robotic Simulation Tasks via Large Language Models
Collecting large amounts of real-world interaction data to train general
robotic policies is often prohibitively expensive, thus motivating the use of
simulation data. However, existing methods for data generation have generally
focused on scene-level diversity (e.g., object instances and poses) rather than
task-level diversity, due to the human effort required to come up with and
verify novel tasks. This has made it challenging for policies trained on
simulation data to demonstrate significant task-level generalization. In this
paper, we propose to automatically generate rich simulation environments and
expert demonstrations by exploiting a large language models' (LLM) grounding
and coding ability. Our approach, dubbed GenSim, has two modes: goal-directed
generation, wherein a target task is given to the LLM and the LLM proposes a
task curriculum to solve the target task, and exploratory generation, wherein
the LLM bootstraps from previous tasks and iteratively proposes novel tasks
that would be helpful in solving more complex tasks. We use GPT4 to expand the
existing benchmark by ten times to over 100 tasks, on which we conduct
supervised finetuning and evaluate several LLMs including finetuned GPTs and
Code Llama on code generation for robotic simulation tasks. Furthermore, we
observe that LLMs-generated simulation programs can enhance task-level
generalization significantly when used for multitask policy training. We
further find that with minimal sim-to-real adaptation, the multitask policies
pretrained on GPT4-generated simulation tasks exhibit stronger transfer to
unseen long-horizon tasks in the real world and outperform baselines by 25%.
See the project website (https://liruiw.github.io/gensim) for code, demos, and
videos.Comment: See our project website (https://liruiw.github.io/gensim), demo and
datasets (https://huggingface.co/spaces/Gen-Sim/Gen-Sim), and code
(https://github.com/liruiw/GenSim) for more detail
LiCROM: Linear-Subspace Continuous Reduced Order Modeling with Neural Fields
Linear reduced-order modeling (ROM) simplifies complex simulations by
approximating the behavior of a system using a simplified kinematic
representation. Typically, ROM is trained on input simulations created with a
specific spatial discretization, and then serves to accelerate simulations with
the same discretization. This discretization-dependence is restrictive.
Becoming independent of a specific discretization would provide flexibility
to mix and match mesh resolutions, connectivity, and type (tetrahedral,
hexahedral) in training data; to accelerate simulations with novel
discretizations unseen during training; and to accelerate adaptive simulations
that temporally or parametrically change the discretization.
We present a flexible, discretization-independent approach to reduced-order
modeling. Like traditional ROM, we represent the configuration as a linear
combination of displacement fields. Unlike traditional ROM, our displacement
fields are continuous maps from every point on the reference domain to a
corresponding displacement vector; these maps are represented as implicit
neural fields.
With linear continuous ROM (LiCROM), our training set can include multiple
geometries undergoing multiple loading conditions, independent of their
discretization. This opens the door to novel applications of reduced order
modeling. We can now accelerate simulations that modify the geometry at
runtime, for instance via cutting, hole punching, and even swapping the entire
mesh. We can also accelerate simulations of geometries unseen during training.
We demonstrate one-shot generalization, training on a single geometry and
subsequently simulating various unseen geometries
H-InDex: Visual Reinforcement Learning with Hand-Informed Representations for Dexterous Manipulation
Human hands possess remarkable dexterity and have long served as a source of
inspiration for robotic manipulation. In this work, we propose a human
andformed visual representation learning framework to
solve difficult terous manipulation tasks ()
with reinforcement learning. Our framework consists of three stages: (i)
pre-training representations with 3D human hand pose estimation, (ii) offline
adapting representations with self-supervised keypoint detection, and (iii)
reinforcement learning with exponential moving average BatchNorm. The last two
stages only modify parameters of the pre-trained representation in
total, ensuring the knowledge from pre-training is maintained to the full
extent. We empirically study 12 challenging dexterous manipulation tasks and
find that H-InDex largely surpasses strong baseline methods and the recent
visual foundation models for motor control. Code is available at
https://yanjieze.com/H-InDex .Comment: NeurIPS 2023. Code and videos: https://yanjieze.com/H-InDe