198 research outputs found
Understanding Pure CLIP Guidance for Voxel Grid NeRF Models
We explore the task of text to 3D object generation using CLIP. Specifically,
we use CLIP for guidance without access to any datasets, a setting we refer to
as pure CLIP guidance. While prior work has adopted this setting, there is no
systematic study of mechanics for preventing adversarial generations within
CLIP. We illustrate how different image-based augmentations prevent the
adversarial generation problem, and how the generated results are impacted. We
test different CLIP model architectures and show that ensembling different
models for guidance can prevent adversarial generations within bigger models
and generate sharper results. Furthermore, we implement an implicit voxel grid
model to show how neural networks provide an additional layer of
regularization, resulting in better geometrical structure and coherency of
generated objects. Compared to prior work, we achieve more coherent results
with higher memory efficiency and faster training speeds
Im2Pano3D: Extrapolating 360 Structure and Semantics Beyond the Field of View
We present Im2Pano3D, a convolutional neural network that generates a dense
prediction of 3D structure and a probability distribution of semantic labels
for a full 360 panoramic view of an indoor scene when given only a partial
observation (<= 50%) in the form of an RGB-D image. To make this possible,
Im2Pano3D leverages strong contextual priors learned from large-scale synthetic
and real-world indoor scenes. To ease the prediction of 3D structure, we
propose to parameterize 3D surfaces with their plane equations and train the
model to predict these parameters directly. To provide meaningful training
supervision, we use multiple loss functions that consider both pixel level
accuracy and global context consistency. Experiments demon- strate that
Im2Pano3D is able to predict the semantics and 3D structure of the unobserved
scene with more than 56% pixel accuracy and less than 0.52m average distance
error, which is significantly better than alternative approaches.Comment: Video summary: https://youtu.be/Au3GmktK-S
Multi3DRefer: Grounding Text Description to Multiple 3D Objects
We introduce the task of localizing a flexible number of objects in
real-world 3D scenes using natural language descriptions. Existing 3D visual
grounding tasks focus on localizing a unique object given a text description.
However, such a strict setting is unnatural as localizing potentially multiple
objects is a common need in real-world scenarios and robotic tasks (e.g.,
visual navigation and object rearrangement). To address this setting we propose
Multi3DRefer, generalizing the ScanRefer dataset and task. Our dataset contains
61926 descriptions of 11609 objects, where zero, single or multiple target
objects are referenced by each description. We also introduce a new evaluation
metric and benchmark methods from prior work to enable further investigation of
multi-modal 3D scene understanding. Furthermore, we develop a better baseline
leveraging 2D features from CLIP by rendering object proposals online with
contrastive learning, which outperforms the state of the art on the ScanRefer
benchmark.Comment: ICCV 202
Evaluating 3D Shape Analysis Methods for Robustness to Rotation Invariance
This paper analyzes the robustness of recent 3D shape descriptors to SO(3)
rotations, something that is fundamental to shape modeling. Specifically, we
formulate the task of rotated 3D object instance detection. To do so, we
consider a database of 3D indoor scenes, where objects occur in different
orientations. We benchmark different methods for feature extraction and
classification in the context of this task. We systematically contrast
different choices in a variety of experimental settings investigating the
impact on the performance of different rotation distributions, different
degrees of partial observations on the object, and the different levels of
difficulty of negative pairs. Our study, on a synthetic dataset of 3D scenes
where objects instances occur in different orientations, reveals that deep
learning-based rotation invariant methods are effective for relatively easy
settings with easy-to-distinguish pairs. However, their performance decreases
significantly when the difference in rotations on the input pair is large, or
when the degree of observation of input objects is reduced, or the difficulty
level of input pair is increased. Finally, we connect feature encodings
designed for rotation-invariant methods to 3D geometry that enable them to
acquire the property of rotation invariance.Comment: 20th Conference on Robots and Vision (CRV) 202
- …