5,986 research outputs found
Multi-view Consistency as Supervisory Signal for Learning Shape and Pose Prediction
We present a framework for learning single-view shape and pose prediction
without using direct supervision for either. Our approach allows leveraging
multi-view observations from unknown poses as supervisory signal during
training. Our proposed training setup enforces geometric consistency between
the independently predicted shape and pose from two views of the same instance.
We consequently learn to predict shape in an emergent canonical (view-agnostic)
frame along with a corresponding pose predictor. We show empirical and
qualitative results using the ShapeNet dataset and observe encouragingly
competitive performance to previous techniques which rely on stronger forms of
supervision. We also demonstrate the applicability of our framework in a
realistic setting which is beyond the scope of existing techniques: using a
training dataset comprised of online product images where the underlying shape
and pose are unknown.Comment: Project url with code: https://shubhtuls.github.io/mvcSnP
Articulation-aware Canonical Surface Mapping
We tackle the tasks of: 1) predicting a Canonical Surface Mapping (CSM) that
indicates the mapping from 2D pixels to corresponding points on a canonical
template shape, and 2) inferring the articulation and pose of the template
corresponding to the input image. While previous approaches rely on keypoint
supervision for learning, we present an approach that can learn without such
annotations. Our key insight is that these tasks are geometrically related, and
we can obtain supervisory signal via enforcing consistency among the
predictions. We present results across a diverse set of animal object
categories, showing that our method can learn articulation and CSM prediction
from image collections using only foreground mask labels for training. We
empirically show that allowing articulation helps learn more accurate CSM
prediction, and that enforcing the consistency with predicted CSM is similarly
critical for learning meaningful articulation.Comment: To appear at CVPR 2020, project page
https://nileshkulkarni.github.io/acsm
Aperture Supervision for Monocular Depth Estimation
We present a novel method to train machine learning algorithms to estimate
scene depths from a single image, by using the information provided by a
camera's aperture as supervision. Prior works use a depth sensor's outputs or
images of the same scene from alternate viewpoints as supervision, while our
method instead uses images from the same viewpoint taken with a varying camera
aperture. To enable learning algorithms to use aperture effects as supervision,
we introduce two differentiable aperture rendering functions that use the input
image and predicted depths to simulate the depth-of-field effects caused by
real camera apertures. We train a monocular depth estimation network end-to-end
to predict the scene depths that best explain these finite aperture images as
defocus-blurred renderings of the input all-in-focus image.Comment: To appear at CVPR 2018 (updated to camera ready version
- …