35,609 research outputs found
Neural 3D Video Synthesis
We propose a novel approach for 3D video synthesis that is able to represent
multi-view video recordings of a dynamic real-world scene in a compact, yet
expressive representation that enables high-quality view synthesis and motion
interpolation. Our approach takes the high quality and compactness of static
neural radiance fields in a new direction: to a model-free, dynamic setting. At
the core of our approach is a novel time-conditioned neural radiance fields
that represents scene dynamics using a set of compact latent codes. To exploit
the fact that changes between adjacent frames of a video are typically small
and locally consistent, we propose two novel strategies for efficient training
of our neural network: 1) An efficient hierarchical training scheme, and 2) an
importance sampling strategy that selects the next rays for training based on
the temporal variation of the input videos. In combination, these two
strategies significantly boost the training speed, lead to fast convergence of
the training process, and enable high quality results. Our learned
representation is highly compact and able to represent a 10 second 30 FPS
multi-view video recording by 18 cameras with a model size of just 28MB. We
demonstrate that our method can render high-fidelity wide-angle novel views at
over 1K resolution, even for highly complex and dynamic scenes. We perform an
extensive qualitative and quantitative evaluation that shows that our approach
outperforms the current state of the art. We include additional video and
information at: https://neural-3d-video.github.io/Comment: Project website: https://neural-3d-video.github.io
ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning
For robots to perform a wide variety of tasks, they require a 3D
representation of the world that is semantically rich, yet compact and
efficient for task-driven perception and planning. Recent approaches have
attempted to leverage features from large vision-language models to encode
semantics in 3D representations. However, these approaches tend to produce maps
with per-point feature vectors, which do not scale well in larger environments,
nor do they contain semantic spatial relationships between entities in the
environment, which are useful for downstream planning. In this work, we propose
ConceptGraphs, an open-vocabulary graph-structured representation for 3D
scenes. ConceptGraphs is built by leveraging 2D foundation models and fusing
their output to 3D by multi-view association. The resulting representations
generalize to novel semantic classes, without the need to collect large 3D
datasets or finetune models. We demonstrate the utility of this representation
through a number of downstream planning tasks that are specified through
abstract (language) prompts and require complex reasoning over spatial and
semantic concepts. (Project page: https://concept-graphs.github.io/ Explainer
video: https://youtu.be/mRhNkQwRYnc )Comment: Project page: https://concept-graphs.github.io/ Explainer video:
https://youtu.be/mRhNkQwRYn
VoxDet: Voxel Learning for Novel Instance Detection
Detecting unseen instances based on multi-view templates is a challenging
problem due to its open-world nature. Traditional methodologies, which
primarily rely on 2D representations and matching techniques, are often
inadequate in handling pose variations and occlusions. To solve this, we
introduce VoxDet, a pioneer 3D geometry-aware framework that fully utilizes the
strong 3D voxel representation and reliable voxel matching mechanism. VoxDet
first ingeniously proposes template voxel aggregation (TVA) module, effectively
transforming multi-view 2D images into 3D voxel features. By leveraging
associated camera poses, these features are aggregated into a compact 3D
template voxel. In novel instance detection, this voxel representation
demonstrates heightened resilience to occlusion and pose variations. We also
discover that a 3D reconstruction objective helps to pre-train the 2D-3D
mapping in TVA. Second, to quickly align with the template voxel, VoxDet
incorporates a Query Voxel Matching (QVM) module. The 2D queries are first
converted into their voxel representation with the learned 2D-3D mapping. We
find that since the 3D voxel representations encode the geometry, we can first
estimate the relative rotation and then compare the aligned voxels, leading to
improved accuracy and efficiency. Exhaustive experiments are conducted on the
demanding LineMod-Occlusion, YCB-video, and the newly built RoboTools
benchmarks, where VoxDet outperforms various 2D baselines remarkably with 20%
higher recall and faster speed. To the best of our knowledge, VoxDet is the
first to incorporate implicit 3D knowledge for 2D tasks.Comment: 17 pages, 10 figure
UMIFormer: Mining the Correlations between Similar Tokens for Multi-View 3D Reconstruction
In recent years, many video tasks have achieved breakthroughs by utilizing
the vision transformer and establishing spatial-temporal decoupling for feature
extraction. Although multi-view 3D reconstruction also faces multiple images as
input, it cannot immediately inherit their success due to completely ambiguous
associations between unstructured views. There is not usable prior
relationship, which is similar to the temporally-coherence property in a video.
To solve this problem, we propose a novel transformer network for Unstructured
Multiple Images (UMIFormer). It exploits transformer blocks for decoupled
intra-view encoding and designed blocks for token rectification that mine the
correlation between similar tokens from different views to achieve decoupled
inter-view encoding. Afterward, all tokens acquired from various branches are
compressed into a fixed-size compact representation while preserving rich
information for reconstruction by leveraging the similarities between tokens.
We empirically demonstrate on ShapeNet and confirm that our decoupled learning
method is adaptable for unstructured multiple images. Meanwhile, the
experiments also verify our model outperforms existing SOTA methods by a large
margin. Code will be available at https://github.com/GaryZhu1996/UMIFormer.Comment: Accepted by ICCV 202
Computer Vision and Image Understanding xxx
Abstract 12 A compact visual representation, called the 3D layered, adaptive-resolution, and multi-13 perspective panorama (LAMP), is proposed for representing large-scale 3D scenes with large 14 variations of depths and obvious occlusions. Two kinds of 3D LAMP representations are 15 proposed: the relief-like LAMP and the image-based LAMP. Both types of LAMPs con-16 cisely represent almost all the information from a long image sequence. Methods to con-17 struct LAMP representations from video sequences with dominant translation are 18 provided. The relief-like LAMP is basically a single extended multi-perspective panoramic 19 view image. Each pixel has a pair of texture and depth values, but each pixel may also have 20 multiple pairs of texture-depth values to represent occlusion in layers, in addition to adap-21 tive resolution changing with depth. The image-based LAMP, on the other hand, consists of 22 a set of multi-perspective layers, each of which has a pair of 2D texture and depth maps, 23 but with adaptive time-sampling scales depending on depths of scene points. Several exam-24 ples of 3D LAMP construction for real image sequences are given. The 3D LAMP is a con-25 cise and powerful representation for image-based rendering. 2
LIDAR GAIT: Benchmarking 3D Gait Recognition with Point Clouds
Video-based gait recognition has achieved impressive results in constrained
scenarios. However, visual cameras neglect human 3D structure information,
which limits the feasibility of gait recognition in the 3D wild world. In this
work, instead of extracting gait features from images, we explore precise 3D
gait features from point clouds and propose a simple yet efficient 3D gait
recognition framework, termed multi-view projection network (MVPNet). MVPNet
first projects point clouds into multiple depth maps from different
perspectives, and then fuse depth images together, to learn the compact
representation with 3D geometry information. Due to the lack of point cloud
datasets, we build the first large-scale Lidar-based gait recognition dataset,
LIDAR GAIT, collected by a Lidar sensor and an RGB camera mounted on a robot.
The dataset contains 25,279 sequences from 1,050 subjects and covers many
different variations, including visibility, views, occlusions, clothing,
carrying, and scenes. Extensive experiments show that, (1) 3D structure
information serves as a significant feature for gait recognition. (2) MVPNet
not only competes with five representative point-based methods, but it also
outperforms existing camera-based methods by large margins. (3) The Lidar
sensor is superior to the RGB camera for gait recognition in the wild. LIDAR
GAIT dataset and MVPNet code will be publicly available.Comment: 16 pages, 16 figures, 3 table
Neural View-Interpolation for Sparse Light Field Video
We suggest representing light field (LF) videos as "one-off" neural networks (NN), i.e., a learned mapping from view-plus-time coordinates to high-resolution color values, trained on sparse views. Initially, this sounds like a bad idea for three main reasons: First, a NN LF will likely have less quality than a same-sized pixel basis representation. Second, only few training data, e.g., 9 exemplars per frame are available for sparse LF videos. Third, there is no generalization across LFs, but across view and time instead. Consequently, a network needs to be trained for each LF video. Surprisingly, these problems can turn into substantial advantages: Other than the linear pixel basis, a NN has to come up with a compact, non-linear i.e., more intelligent, explanation of color, conditioned on the sparse view and time coordinates. As observed for many NN however, this representation now is interpolatable: if the image output for sparse view coordinates is plausible, it is for all intermediate, continuous coordinates as well. Our specific network architecture involves a differentiable occlusion-aware warping step, which leads to a compact set of trainable parameters and consequently fast learning and fast execution
- …