1,560 research outputs found
Learning Navigational Visual Representations with Semantic Map Supervision
Being able to perceive the semantics and the spatial structure of the
environment is essential for visual navigation of a household robot. However,
most existing works only employ visual backbones pre-trained either with
independent images for classification or with self-supervised learning methods
to adapt to the indoor navigation domain, neglecting the spatial relationships
that are essential to the learning of navigation. Inspired by the behavior that
humans naturally build semantically and spatially meaningful cognitive maps in
their brains during navigation, in this paper, we propose a novel
navigational-specific visual representation learning method by contrasting the
agent's egocentric views and semantic maps (Ego-Map). We apply the visual
transformer as the backbone encoder and train the model with data collected
from the large-scale Habitat-Matterport3D environments. Ego-Map learning
transfers the compact and rich information from a map, such as objects,
structure and transition, to the agent's egocentric representations for
navigation. Experiments show that agents using our learned representations on
object-goal navigation outperform recent visual pre-training methods. Moreover,
our representations significantly improve vision-and-language navigation in
continuous environments for both high-level and low-level action spaces,
achieving new state-of-the-art results of 47% SR and 41% SPL on the test
server
Recommended from our members
Embodied learning for visual recognition
The field of visual recognition in recent years has come to rely on large expensively curated and manually labeled "bags of disembodied images". In the wake of this, my focus has been on understanding and exploiting alternate "free" sources of supervision available to visual learning agents that are situated within real environments. For example, even simply moving from orderless image collections to continuous visual observations offers opportunities to understand the dynamics and other physical properties of the visual world. Further, embodied agents may have the abilities to move around their environment and/or effect changes within it, in which case these abilities offer new means to acquire useful supervision. In this dissertation, I present my work along this and related directions.Electrical and Computer Engineerin
ALP: Action-Aware Embodied Learning for Perception
Current methods in training and benchmarking vision models exhibit an
over-reliance on passive, curated datasets. Although models trained on these
datasets have shown strong performance in a wide variety of tasks such as
classification, detection, and segmentation, they fundamentally are unable to
generalize to an ever-evolving world due to constant out-of-distribution shifts
of input data. Therefore, instead of training on fixed datasets, can we
approach learning in a more human-centric and adaptive manner? In this paper,
we introduce \textbf{A}ction-aware Embodied \textbf{L}earning for
\textbf{P}erception (ALP), an embodied learning framework that incorporates
action information into representation learning through a combination of
optimizing policy gradients through reinforcement learning and inverse dynamics
prediction objectives. Our method actively explores complex 3D environments to
both learn generalizable task-agnostic representations as well as collect
downstream training data. We show that ALP outperforms existing baselines in
object detection and semantic segmentation. In addition, we show that by
training on actively collected data more relevant to the environment and task,
our method generalizes more robustly to downstream tasks compared to models
pre-trained on fixed datasets such as ImageNet.Comment: preprin
Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation
Current computer vision models, unlike the human visual system, cannot yet
achieve general-purpose visual understanding. Existing efforts to create a
general vision model are limited in the scope of assessed tasks and offer no
overarching framework to perform them holistically. We present a new
comprehensive benchmark, General-purpose Visual Understanding Evaluation
(G-VUE), covering the full spectrum of visual cognitive abilities with four
functional domains \unicode{x2014} Perceive, Ground, Reason, and Act. The
four domains are embodied in 11 carefully curated tasks, from 3D reconstruction
to visual reasoning and manipulation. Along with the benchmark, we provide a
general encoder-decoder framework to allow for the evaluation of arbitrary
visual representation on all 11 tasks. We evaluate various pre-trained visual
representations with our framework and observe that (1) Transformer-based
visual backbone generally outperforms CNN-based backbone on G-VUE, (2) visual
representations from vision-language pre-training are superior to those with
vision-only pre-training across visual tasks. With G-VUE, we provide a holistic
evaluation standard to motivate research toward building general-purpose visual
systems via obtaining more general-purpose visual representations
LACMA: Language-Aligning Contrastive Learning with Meta-Actions for Embodied Instruction Following
End-to-end Transformers have demonstrated an impressive success rate for
Embodied Instruction Following when the environment has been seen in training.
However, they tend to struggle when deployed in an unseen environment. This
lack of generalizability is due to the agent's insensitivity to subtle changes
in natural language instructions. To mitigate this issue, we propose explicitly
aligning the agent's hidden states with the instructions via contrastive
learning. Nevertheless, the semantic gap between high-level language
instructions and the agent's low-level action space remains an obstacle.
Therefore, we further introduce a novel concept of meta-actions to bridge the
gap. Meta-actions are ubiquitous action patterns that can be parsed from the
original action sequence. These patterns represent higher-level semantics that
are intuitively aligned closer to the instructions. When meta-actions are
applied as additional training signals, the agent generalizes better to unseen
environments. Compared to a strong multi-modal Transformer baseline, we achieve
a significant 4.5% absolute gain in success rate in unseen environments of
ALFRED Embodied Instruction Following. Additional analysis shows that the
contrastive objective and meta-actions are complementary in achieving the best
results, and the resulting agent better aligns its states with corresponding
instructions, making it more suitable for real-world embodied agents. The code
is available at: https://github.com/joeyy5588/LACMA.Comment: EMNLP 202
Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes
Training models to apply common-sense linguistic knowledge and visual
concepts from 2D images to 3D scene understanding is a promising direction that
researchers have only recently started to explore. However, it still remains
understudied whether 2D distilled knowledge can provide useful representations
for downstream 3D vision-language tasks such as 3D question answering. In this
paper, we propose a novel 3D pre-training Vision-Language method, namely
Multi-CLIP, that enables a model to learn language-grounded and transferable 3D
scene point cloud representations. We leverage the representational power of
the CLIP model by maximizing the agreement between the encoded 3D scene
features and the corresponding 2D multi-view image and text embeddings in the
CLIP space via a contrastive objective. To validate our approach, we consider
the challenging downstream tasks of 3D Visual Question Answering (3D-VQA) and
3D Situated Question Answering (3D-SQA). To this end, we develop novel
multi-modal transformer-based architectures and we demonstrate how our
pre-training method can benefit their performance. Quantitative and qualitative
experimental results show that Multi-CLIP outperforms state-of-the-art works
across the downstream tasks of 3D-VQA and 3D-SQA and leads to a well-structured
3D scene feature space.Comment: The first two authors contributed equall
Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following
We introduce Point-Bind, a 3D multi-modality model aligning point clouds with
2D image, language, audio, and video. Guided by ImageBind, we construct a joint
embedding space between 3D and multi-modalities, enabling many promising
applications, e.g., any-to-3D generation, 3D embedding arithmetic, and 3D
open-world understanding. On top of this, we further present Point-LLM, the
first 3D large language model (LLM) following 3D multi-modal instructions. By
parameter-efficient fine-tuning techniques, Point-LLM injects the semantics of
Point-Bind into pre-trained LLMs, e.g., LLaMA, which requires no 3D instruction
data, but exhibits superior 3D and multi-modal question-answering capacity. We
hope our work may cast a light on the community for extending 3D point clouds
to multi-modality applications. Code is available at
https://github.com/ZiyuGuo99/Point-Bind_Point-LLM.Comment: Work in progress. Code is available at
https://github.com/ZiyuGuo99/Point-Bind_Point-LL
EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture Generation
Generating vivid and diverse 3D co-speech gestures is crucial for various
applications in animating virtual avatars. While most existing methods can
generate gestures from audio directly, they usually overlook that emotion is
one of the key factors of authentic co-speech gesture generation. In this work,
we propose EmotionGesture, a novel framework for synthesizing vivid and diverse
emotional co-speech 3D gestures from audio. Considering emotion is often
entangled with the rhythmic beat in speech audio, we first develop an
Emotion-Beat Mining module (EBM) to extract the emotion and audio beat features
as well as model their correlation via a transcript-based visual-rhythm
alignment. Then, we propose an initial pose based Spatial-Temporal Prompter
(STP) to generate future gestures from the given initial poses. STP effectively
models the spatial-temporal correlations between the initial poses and the
future gestures, thus producing the spatial-temporal coherent pose prompt. Once
we obtain pose prompts, emotion, and audio beat features, we will generate 3D
co-speech gestures through a transformer architecture. However, considering the
poses of existing datasets often contain jittering effects, this would lead to
generating unstable gestures. To address this issue, we propose an effective
objective function, dubbed Motion-Smooth Loss. Specifically, we model motion
offset to compensate for jittering ground-truth by forcing gestures to be
smooth. Last, we present an emotion-conditioned VAE to sample emotion features,
enabling us to generate diverse emotional results. Extensive experiments
demonstrate that our framework outperforms the state-of-the-art, achieving
vivid and diverse emotional co-speech 3D gestures.Comment: Under revie
MIMIC: Masked Image Modeling with Image Correspondences
Many pixelwise dense prediction tasks-depth estimation and semantic
segmentation in computer vision today rely on pretrained image representations.
Therefore, curating effective pretraining datasets is vital. Unfortunately, the
effective pretraining datasets are those with multi-view scenes and have only
been curated using annotated 3D meshes, point clouds, and camera parameters
from simulated environments. We propose a dataset-curation mechanism that does
not require any annotations. We mine two datasets: MIMIC-1M with 1.3M and
MIMIC-3M with 3.1M multi-view image pairs from open-sourced video datasets and
from synthetic 3D environments. We train multiple self-supervised models with
different masked image modeling objectives to showcase the following findings:
Representations trained on MIMIC-3M outperform those mined using annotations on
multiple downstream tasks, including depth estimation, semantic segmentation,
surface normals, and pose estimation. They also outperform representations that
are frozen and when downstream training data is limited to few-shot. Larger
dataset (MIMIC-3M) significantly improves performance, which is promising since
our curation method can arbitrarily scale to produce even larger datasets.
MIMIC code, dataset, and pretrained models are open-sourced at
https://github.com/RAIVNLab/MIMIC
PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm
In contrast to numerous NLP and 2D computer vision foundational models, the
learning of a robust and highly generalized 3D foundational model poses
considerably greater challenges. This is primarily due to the inherent data
variability and the diversity of downstream tasks. In this paper, we introduce
a comprehensive 3D pre-training framework designed to facilitate the
acquisition of efficient 3D representations, thereby establishing a pathway to
3D foundational models. Motivated by the fact that informative 3D features
should be able to encode rich geometry and appearance cues that can be utilized
to render realistic images, we propose a novel universal paradigm to learn
point cloud representations by differentiable neural rendering, serving as a
bridge between 3D and 2D worlds. We train a point cloud encoder within a
devised volumetric neural renderer by comparing the rendered images with the
real images. Notably, our approach demonstrates the seamless integration of the
learned 3D encoder into diverse downstream tasks. These tasks encompass not
only high-level challenges such as 3D detection and segmentation but also
low-level objectives like 3D reconstruction and image synthesis, spanning both
indoor and outdoor scenarios. Besides, we also illustrate the capability of
pre-training a 2D backbone using the proposed universal methodology, surpassing
conventional pre-training methods by a large margin. For the first time,
PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor
benchmarks. The consistent improvements in various settings imply the
effectiveness of the proposed method. Code and models will be made available at
https://github.com/OpenGVLab/PonderV2.Comment: arXiv admin note: text overlap with arXiv:2301.0015
- …