5 research outputs found
SST: Real-time End-to-end Monocular 3D Reconstruction via Sparse Spatial-Temporal Guidance
Real-time monocular 3D reconstruction is a challenging problem that remains
unsolved. Although recent end-to-end methods have demonstrated promising
results, tiny structures and geometric boundaries are hardly captured due to
their insufficient supervision neglecting spatial details and oversimplified
feature fusion ignoring temporal cues. To address the problems, we propose an
end-to-end 3D reconstruction network SST, which utilizes Sparse estimated
points from visual SLAM system as additional Spatial guidance and fuses
Temporal features via a novel cross-modal attention mechanism, achieving more
detailed reconstruction results. We propose a Local Spatial-Temporal Fusion
module to exploit more informative spatial-temporal cues from multi-view color
information and sparse priors, as well a Global Spatial-Temporal Fusion module
to refine the local TSDF volumes with the world-frame model from coarse to
fine. Extensive experiments on ScanNet and 7-Scenes demonstrate that SST
outperforms all state-of-the-art competitors, whilst keeping a high inference
speed at 59 FPS, enabling real-world applications with real-time requirements
DDF-HO: Hand-Held Object Reconstruction via Conditional Directed Distance Field
Reconstructing hand-held objects from a single RGB image is an important and
challenging problem. Existing works utilizing Signed Distance Fields (SDF)
reveal limitations in comprehensively capturing the complex hand-object
interactions, since SDF is only reliable within the proximity of the target,
and hence, infeasible to simultaneously encode local hand and object cues. To
address this issue, we propose DDF-HO, a novel approach leveraging Directed
Distance Field (DDF) as the shape representation. Unlike SDF, DDF maps a ray in
3D space, consisting of an origin and a direction, to corresponding DDF values,
including a binary visibility signal determining whether the ray intersects the
objects and a distance value measuring the distance from origin to target in
the given direction. We randomly sample multiple rays and collect local to
global geometric features for them by introducing a novel 2D ray-based feature
aggregation scheme and a 3D intersection-aware hand pose embedding, combining
2D-3D features to model hand-object interactions. Extensive experiments on
synthetic and real-world datasets demonstrate that DDF-HO consistently
outperforms all baseline methods by a large margin, especially under Chamfer
Distance, with about 80% leap forward. Codes and trained models will be
released soon
MOHO: Learning Single-view Hand-held Object Reconstruction with Multi-view Occlusion-Aware Supervision
Previous works concerning single-view hand-held object reconstruction
typically utilize supervision from 3D ground truth models, which are hard to
collect in real world. In contrast, abundant videos depicting hand-object
interactions can be accessed easily with low cost, although they only give
partial object observations with complex occlusion. In this paper, we present
MOHO to reconstruct hand-held object from a single image with multi-view
supervision from hand-object videos, tackling two predominant challenges
including object's self-occlusion and hand-induced occlusion. MOHO inputs
semantic features indicating visible object parts and geometric embeddings
provided by hand articulations as partial-to-full cues to resist object's
self-occlusion, so as to recover full shape of the object. Meanwhile, a novel
2D-3D hand-occlusion-aware training scheme following the synthetic-to-real
paradigm is proposed to release hand-induced occlusion. In the synthetic
pre-training stage, 2D-3D hand-object correlations are constructed by
supervising MOHO with rendered images to complete the hand-concealed regions of
the object in both 2D and 3D space. Subsequently, MOHO is finetuned in real
world by the mask-weighted volume rendering supervision adopting hand-object
correlations obtained during pre-training. Extensive experiments on HO3D and
DexYCB datasets demonstrate that 2D-supervised MOHO gains superior results
against 3D-supervised methods by a large margin. Codes and key assets will be
released soon
U-RED: Unsupervised 3D Shape Retrieval and Deformation for Partial Point Clouds
In this paper, we propose U-RED, an Unsupervised shape REtrieval and
Deformation pipeline that takes an arbitrary object observation as input,
typically captured by RGB images or scans, and jointly retrieves and deforms
the geometrically similar CAD models from a pre-established database to tightly
match the target. Considering existing methods typically fail to handle noisy
partial observations, U-RED is designed to address this issue from two aspects.
First, since one partial shape may correspond to multiple potential full
shapes, the retrieval method must allow such an ambiguous one-to-many
relationship. Thereby U-RED learns to project all possible full shapes of a
partial target onto the surface of a unit sphere. Then during inference, each
sampling on the sphere will yield a feasible retrieval. Second, since
real-world partial observations usually contain noticeable noise, a reliable
learned metric that measures the similarity between shapes is necessary for
stable retrieval. In U-RED, we design a novel point-wise residual-guided metric
that allows noise-robust comparison. Extensive experiments on the synthetic
datasets PartNet, ComplementMe and the real-world dataset Scan2CAD demonstrate
that U-RED surpasses existing state-of-the-art approaches by 47.3%, 16.7% and
31.6% respectively under Chamfer Distance.Comment: ICCV202
CCD-3DR: Consistent Conditioning in Diffusion for Single-Image 3D Reconstruction
In this paper, we present a novel shape reconstruction method leveraging
diffusion model to generate 3D sparse point cloud for the object captured in a
single RGB image. Recent methods typically leverage global embedding or local
projection-based features as the condition to guide the diffusion model.
However, such strategies fail to consistently align the denoised point cloud
with the given image, leading to unstable conditioning and inferior
performance. In this paper, we present CCD-3DR, which exploits a novel centered
diffusion probabilistic model for consistent local feature conditioning. We
constrain the noise and sampled point cloud from the diffusion model into a
subspace where the point cloud center remains unchanged during the forward
diffusion process and reverse process. The stable point cloud center further
serves as an anchor to align each point with its corresponding local
projection-based features. Extensive experiments on synthetic benchmark
ShapeNet-R2N2 demonstrate that CCD-3DR outperforms all competitors by a large
margin, with over 40% improvement. We also provide results on real-world
dataset Pix3D to thoroughly demonstrate the potential of CCD-3DR in real-world
applications. Codes will be released soonComment: 11 page