4,379 research outputs found
Coordinate Quantized Neural Implicit Representations for Multi-view Reconstruction
In recent years, huge progress has been made on learning neural implicit
representations from multi-view images for 3D reconstruction. As an additional
input complementing coordinates, using sinusoidal functions as positional
encodings plays a key role in revealing high frequency details with
coordinate-based neural networks. However, high frequency positional encodings
make the optimization unstable, which results in noisy reconstructions and
artifacts in empty space. To resolve this issue in a general sense, we
introduce to learn neural implicit representations with quantized coordinates,
which reduces the uncertainty and ambiguity in the field during optimization.
Instead of continuous coordinates, we discretize continuous coordinates into
discrete coordinates using nearest interpolation among quantized coordinates
which are obtained by discretizing the field in an extremely high resolution.
We use discrete coordinates and their positional encodings to learn implicit
functions through volume rendering. This significantly reduces the variations
in the sample space, and triggers more multi-view consistency constraints on
intersections of rays from different views, which enables to infer implicit
function in a more effective way. Our quantized coordinates do not bring any
computational burden, and can seamlessly work upon the latest methods. Our
evaluations under the widely used benchmarks show our superiority over the
state-of-the-art. Our code is available at
https://github.com/MachinePerceptionLab/CQ-NIR.Comment: to be appeared at ICCV 202
Learning Neural Implicit through Volume Rendering with Attentive Depth Fusion Priors
Learning neural implicit representations has achieved remarkable performance
in 3D reconstruction from multi-view images. Current methods use volume
rendering to render implicit representations into either RGB or depth images
that are supervised by multi-view ground truth. However, rendering a view each
time suffers from incomplete depth at holes and unawareness of occluded
structures from the depth supervision, which severely affects the accuracy of
geometry inference via volume rendering. To resolve this issue, we propose to
learn neural implicit representations from multi-view RGBD images through
volume rendering with an attentive depth fusion prior. Our prior allows neural
networks to perceive coarse 3D structures from the Truncated Signed Distance
Function (TSDF) fused from all depth images available for rendering. The TSDF
enables accessing the missing depth at holes on one depth image and the
occluded parts that are invisible from the current view. By introducing a novel
attention mechanism, we allow neural networks to directly use the depth fusion
prior with the inferred occupancy as the learned implicit function. Our
attention mechanism works with either a one-time fused TSDF that represents a
whole scene or an incrementally fused TSDF that represents a partial scene in
the context of Simultaneous Localization and Mapping (SLAM). Our evaluations on
widely used benchmarks including synthetic and real-world scans show our
superiority over the latest neural implicit methods. Project page:
https://machineperceptionlab.github.io/Attentive_DF_Prior/Comment: NeurIPS 202
Articulation-aware Canonical Surface Mapping
We tackle the tasks of: 1) predicting a Canonical Surface Mapping (CSM) that
indicates the mapping from 2D pixels to corresponding points on a canonical
template shape, and 2) inferring the articulation and pose of the template
corresponding to the input image. While previous approaches rely on keypoint
supervision for learning, we present an approach that can learn without such
annotations. Our key insight is that these tasks are geometrically related, and
we can obtain supervisory signal via enforcing consistency among the
predictions. We present results across a diverse set of animal object
categories, showing that our method can learn articulation and CSM prediction
from image collections using only foreground mask labels for training. We
empirically show that allowing articulation helps learn more accurate CSM
prediction, and that enforcing the consistency with predicted CSM is similarly
critical for learning meaningful articulation.Comment: To appear at CVPR 2020, project page
https://nileshkulkarni.github.io/acsm
Learning 3D Human Pose from Structure and Motion
3D human pose estimation from a single image is a challenging problem,
especially for in-the-wild settings due to the lack of 3D annotated data. We
propose two anatomically inspired loss functions and use them with a
weakly-supervised learning framework to jointly learn from large-scale
in-the-wild 2D and indoor/synthetic 3D data. We also present a simple temporal
network that exploits temporal and structural cues present in predicted pose
sequences to temporally harmonize the pose estimations. We carefully analyze
the proposed contributions through loss surface visualizations and sensitivity
analysis to facilitate deeper understanding of their working mechanism. Our
complete pipeline improves the state-of-the-art by 11.8% and 12% on Human3.6M
and MPI-INF-3DHP, respectively, and runs at 30 FPS on a commodity graphics
card.Comment: ECCV 2018. Project page: https://www.cse.iitb.ac.in/~rdabral/3DPose
- …