46 research outputs found

    Urban2Vec: Incorporating Street View Imagery and POIs for Multi-Modal Urban Neighborhood Embedding

    Full text link
    Understanding intrinsic patterns and predicting spatiotemporal characteristics of cities require a comprehensive representation of urban neighborhoods. Existing works relied on either inter- or intra-region connectivities to generate neighborhood representations but failed to fully utilize the informative yet heterogeneous data within neighborhoods. In this work, we propose Urban2Vec, an unsupervised multi-modal framework which incorporates both street view imagery and point-of-interest (POI) data to learn neighborhood embeddings. Specifically, we use a convolutional neural network to extract visual features from street view images while preserving geospatial similarity. Furthermore, we model each POI as a bag-of-words containing its category, rating, and review information. Analog to document embedding in natural language processing, we establish the semantic similarity between neighborhood ("document") and the words from its surrounding POIs in the vector space. By jointly encoding visual, textual, and geospatial information into the neighborhood representation, Urban2Vec can achieve performances better than baseline models and comparable to fully-supervised methods in downstream prediction tasks. Extensive experiments on three U.S. metropolitan areas also demonstrate the model interpretability, generalization capability, and its value in neighborhood similarity analysis.Comment: To appear in Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20

    SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing

    Full text link
    Remote sensing imagery, despite its broad applications in helping achieve Sustainable Development Goals and tackle climate change, has not yet benefited from the recent advancements of versatile, task-agnostic vision language models (VLMs). A key reason is that the large-scale, semantically diverse image-text dataset required for developing VLMs is still absent for remote sensing images. Unlike natural images, remote sensing images and their associated text descriptions cannot be efficiently collected from the public Internet at scale. In this work, we bridge this gap by using geo-coordinates to automatically connect open, unlabeled remote sensing images with rich semantics covered in OpenStreetMap, and thus construct SkyScript, a comprehensive vision-language dataset for remote sensing images, comprising 2.6 million image-text pairs covering 29K distinct semantic tags. With continual pre-training on this dataset, we obtain a VLM that surpasses baseline models with a 6.2% average accuracy gain in zero-shot scene classification across seven benchmark datasets. It also demonstrates the ability of zero-shot transfer for fine-grained object attribute classification and cross-modal retrieval. We hope this dataset can support the advancement of VLMs for various multi-modal tasks in remote sensing, such as open-vocabulary classification, retrieval, captioning, and text-to-image synthesis.Comment: Accepted by AAAI 202

    RL-ViGen: A Reinforcement Learning Benchmark for Visual Generalization

    Full text link
    Visual Reinforcement Learning (Visual RL), coupled with high-dimensional observations, has consistently confronted the long-standing challenge of generalization. Despite the focus on algorithms aimed at resolving visual generalization problems, we argue that the devil is in the existing benchmarks as they are restricted to isolated tasks and generalization categories, undermining a comprehensive evaluation of agents' visual generalization capabilities. To bridge this gap, we introduce RL-ViGen: a novel Reinforcement Learning Benchmark for Visual Generalization, which contains diverse tasks and a wide spectrum of generalization types, thereby facilitating the derivation of more reliable conclusions. Furthermore, RL-ViGen incorporates the latest generalization visual RL algorithms into a unified framework, under which the experiment results indicate that no single existing algorithm has prevailed universally across tasks. Our aspiration is that RL-ViGen will serve as a catalyst in this area, and lay a foundation for the future creation of universal visual generalization RL agents suitable for real-world scenarios. Access to our code and implemented algorithms is provided at https://gemcollector.github.io/RL-ViGen/

    GenSim: Generating Robotic Simulation Tasks via Large Language Models

    Full text link
    Collecting large amounts of real-world interaction data to train general robotic policies is often prohibitively expensive, thus motivating the use of simulation data. However, existing methods for data generation have generally focused on scene-level diversity (e.g., object instances and poses) rather than task-level diversity, due to the human effort required to come up with and verify novel tasks. This has made it challenging for policies trained on simulation data to demonstrate significant task-level generalization. In this paper, we propose to automatically generate rich simulation environments and expert demonstrations by exploiting a large language models' (LLM) grounding and coding ability. Our approach, dubbed GenSim, has two modes: goal-directed generation, wherein a target task is given to the LLM and the LLM proposes a task curriculum to solve the target task, and exploratory generation, wherein the LLM bootstraps from previous tasks and iteratively proposes novel tasks that would be helpful in solving more complex tasks. We use GPT4 to expand the existing benchmark by ten times to over 100 tasks, on which we conduct supervised finetuning and evaluate several LLMs including finetuned GPTs and Code Llama on code generation for robotic simulation tasks. Furthermore, we observe that LLMs-generated simulation programs can enhance task-level generalization significantly when used for multitask policy training. We further find that with minimal sim-to-real adaptation, the multitask policies pretrained on GPT4-generated simulation tasks exhibit stronger transfer to unseen long-horizon tasks in the real world and outperform baselines by 25%. See the project website (https://liruiw.github.io/gensim) for code, demos, and videos.Comment: See our project website (https://liruiw.github.io/gensim), demo and datasets (https://huggingface.co/spaces/Gen-Sim/Gen-Sim), and code (https://github.com/liruiw/GenSim) for more detail

    LiCROM: Linear-Subspace Continuous Reduced Order Modeling with Neural Fields

    Full text link
    Linear reduced-order modeling (ROM) simplifies complex simulations by approximating the behavior of a system using a simplified kinematic representation. Typically, ROM is trained on input simulations created with a specific spatial discretization, and then serves to accelerate simulations with the same discretization. This discretization-dependence is restrictive. Becoming independent of a specific discretization would provide flexibility to mix and match mesh resolutions, connectivity, and type (tetrahedral, hexahedral) in training data; to accelerate simulations with novel discretizations unseen during training; and to accelerate adaptive simulations that temporally or parametrically change the discretization. We present a flexible, discretization-independent approach to reduced-order modeling. Like traditional ROM, we represent the configuration as a linear combination of displacement fields. Unlike traditional ROM, our displacement fields are continuous maps from every point on the reference domain to a corresponding displacement vector; these maps are represented as implicit neural fields. With linear continuous ROM (LiCROM), our training set can include multiple geometries undergoing multiple loading conditions, independent of their discretization. This opens the door to novel applications of reduced order modeling. We can now accelerate simulations that modify the geometry at runtime, for instance via cutting, hole punching, and even swapping the entire mesh. We can also accelerate simulations of geometries unseen during training. We demonstrate one-shot generalization, training on a single geometry and subsequently simulating various unseen geometries

    H-InDex: Visual Reinforcement Learning with Hand-Informed Representations for Dexterous Manipulation

    Full text link
    Human hands possess remarkable dexterity and have long served as a source of inspiration for robotic manipulation. In this work, we propose a human H\textbf{H}and-In\textbf{-In}formed visual representation learning framework to solve difficult Dex\textbf{Dex}terous manipulation tasks (H-InDex\textbf{H-InDex}) with reinforcement learning. Our framework consists of three stages: (i) pre-training representations with 3D human hand pose estimation, (ii) offline adapting representations with self-supervised keypoint detection, and (iii) reinforcement learning with exponential moving average BatchNorm. The last two stages only modify 0.36%0.36\% parameters of the pre-trained representation in total, ensuring the knowledge from pre-training is maintained to the full extent. We empirically study 12 challenging dexterous manipulation tasks and find that H-InDex largely surpasses strong baseline methods and the recent visual foundation models for motor control. Code is available at https://yanjieze.com/H-InDex .Comment: NeurIPS 2023. Code and videos: https://yanjieze.com/H-InDe
    corecore