92,451 research outputs found
Deep Non-Rigid Structure from Motion
Current non-rigid structure from motion (NRSfM) algorithms are mainly limited
with respect to: (i) the number of images, and (ii) the type of shape
variability they can handle. This has hampered the practical utility of NRSfM
for many applications within vision. In this paper we propose a novel deep
neural network to recover camera poses and 3D points solely from an ensemble of
2D image coordinates. The proposed neural network is mathematically
interpretable as a multi-layer block sparse dictionary learning problem, and
can handle problems of unprecedented scale and shape complexity. Extensive
experiments demonstrate the impressive performance of our approach where we
exhibit superior precision and robustness against all available
state-of-the-art works in the order of magnitude. We further propose a quality
measure (based on the network weights) which circumvents the need for 3D
ground-truth to ascertain the confidence we have in the reconstruction.Comment: Oral Paper in ICCV 2019. arXiv admin note: substantial text overlap
with arXiv:1902.10840, arXiv:1907.1312
Procrustean Regression Networks: Learning 3D Structure of Non-Rigid Objects from 2D Annotations
We propose a novel framework for training neural networks which is capable of
learning 3D information of non-rigid objects when only 2D annotations are
available as ground truths. Recently, there have been some approaches that
incorporate the problem setting of non-rigid structure-from-motion (NRSfM) into
deep learning to learn 3D structure reconstruction. The most important
difficulty of NRSfM is to estimate both the rotation and deformation at the
same time, and previous works handle this by regressing both of them. In this
paper, we resolve this difficulty by proposing a loss function wherein the
suitable rotation is automatically determined. Trained with the cost function
consisting of the reprojection error and the low-rank term of aligned shapes,
the network learns the 3D structures of such objects as human skeletons and
faces during the training, whereas the testing is done in a single-frame basis.
The proposed method can handle inputs with missing entries and experimental
results validate that the proposed framework shows superior reconstruction
performance to the state-of-the-art method on the Human 3.6M, 300-VW, and
SURREAL datasets, even though the underlying network structure is very simple.Comment: ECCV 202
Deep NRSfM++: Towards Unsupervised 2D-3D Lifting in the Wild
The recovery of 3D shape and pose from 2D landmarks stemming from a large
ensemble of images can be viewed as a non-rigid structure from motion (NRSfM)
problem. Classical NRSfM approaches, however, are problematic as they rely on
heuristic priors on the 3D structure (e.g. low rank) that do not scale well to
large datasets. Learning-based methods are showing the potential to reconstruct
a much broader set of 3D structures than classical methods -- dramatically
expanding the importance of NRSfM to atemporal unsupervised 2D to 3D lifting.
Hitherto, these learning approaches have not been able to effectively model
perspective cameras or handle missing/occluded points -- limiting their
applicability to in-the-wild datasets. In this paper, we present a generalized
strategy for improving learning-based NRSfM methods to tackle the above issues.
Our approach, Deep NRSfM++, achieves state-of-the-art performance across
numerous large-scale benchmarks, outperforming both classical and
learning-based 2D-3D lifting methods
GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose
We propose GeoNet, a jointly unsupervised learning framework for monocular
depth, optical flow and ego-motion estimation from videos. The three components
are coupled by the nature of 3D scene geometry, jointly learned by our
framework in an end-to-end manner. Specifically, geometric relationships are
extracted over the predictions of individual modules and then combined as an
image reconstruction loss, reasoning about static and dynamic scene parts
separately. Furthermore, we propose an adaptive geometric consistency loss to
increase robustness towards outliers and non-Lambertian regions, which resolves
occlusions and texture ambiguities effectively. Experimentation on the KITTI
driving dataset reveals that our scheme achieves state-of-the-art results in
all of the three tasks, performing better than previously unsupervised methods
and comparably with supervised ones.Comment: Accepted to CVPR 2018; Code will be made available at
https://github.com/yzcjtr/GeoNe
HDM-Net: Monocular Non-Rigid 3D Reconstruction with Learned Deformation Model
Monocular dense 3D reconstruction of deformable objects is a hard ill-posed
problem in computer vision. Current techniques either require dense
correspondences and rely on motion and deformation cues, or assume a highly
accurate reconstruction (referred to as a template) of at least a single frame
given in advance and operate in the manner of non-rigid tracking. Accurate
computation of dense point tracks often requires multiple frames and might be
computationally expensive. Availability of a template is a very strong prior
which restricts system operation to a pre-defined environment and scenarios. In
this work, we propose a new hybrid approach for monocular non-rigid
reconstruction which we call Hybrid Deformation Model Network (HDM-Net). In our
approach, deformation model is learned by a deep neural network, with a
combination of domain-specific loss functions. We train the network with
multiple states of a non-rigidly deforming structure with a known shape at
rest. HDM-Net learns different reconstruction cues including texture-dependent
surface deformations, shading and contours. We show generalisability of HDM-Net
to states not presented in the training dataset, with unseen textures and under
new illumination conditions. Experiments with noisy data and a comparison with
other methods demonstrate robustness and accuracy of the proposed approach and
suggest possible application scenarios of the new technique in interventional
diagnostics and augmented reality.Comment: 9 pages, 9 figure
Every Pixel Counts ++: Joint Learning of Geometry and Motion with 3D Holistic Understanding
Learning to estimate 3D geometry in a single frame and optical flow from
consecutive frames by watching unlabeled videos via deep convolutional network
has made significant progress recently. Current state-of-the-art (SoTA) methods
treat the two tasks independently. One typical assumption of the existing depth
estimation methods is that the scenes contain no independent moving objects.
while object moving could be easily modeled using optical flow. In this paper,
we propose to address the two tasks as a whole, i.e. to jointly understand
per-pixel 3D geometry and motion. This eliminates the need of static scene
assumption and enforces the inherent geometrical consistency during the
learning process, yielding significantly improved results for both tasks. We
call our method as "Every Pixel Counts++" or "EPC++". Specifically, during
training, given two consecutive frames from a video, we adopt three parallel
networks to predict the camera motion (MotionNet), dense depth map (DepthNet),
and per-pixel optical flow between two frames (OptFlowNet) respectively. The
three types of information are fed into a holistic 3D motion parser (HMP), and
per-pixel 3D motion of both rigid background and moving objects are
disentangled and recovered. Comprehensive experiments were conducted on
datasets with different scenes, including driving scenario (KITTI 2012 and
KITTI 2015 datasets), mixed outdoor/indoor scenes (Make3D) and synthetic
animation (MPI Sintel dataset). Performance on the five tasks of depth
estimation, optical flow estimation, odometry, moving object segmentation and
scene flow estimation shows that our approach outperforms other SoTA methods.
Code will be available at: https://github.com/chenxuluo/EPC.Comment: Chenxu Luo, Zhenheng Yang, and Peng Wang contributed equally, TPAMI
submissio
Deep Part Induction from Articulated Object Pairs
Object functionality is often expressed through part articulation -- as when
the two rigid parts of a scissor pivot against each other to perform the
cutting function. Such articulations are often similar across objects within
the same functional category. In this paper, we explore how the observation of
different articulation states provides evidence for part structure and motion
of 3D objects. Our method takes as input a pair of unsegmented shapes
representing two different articulation states of two functionally related
objects, and induces their common parts along with their underlying rigid
motion. This is a challenging setting, as we assume no prior shape structure,
no prior shape category information, no consistent shape orientation, the
articulation states may belong to objects of different geometry, plus we allow
inputs to be noisy and partial scans, or point clouds lifted from RGB images.
Our method learns a neural network architecture with three modules that
respectively propose correspondences, estimate 3D deformation flows, and
perform segmentation. To achieve optimal performance, our architecture
alternates between correspondence, deformation flow, and segmentation
prediction iteratively in an ICP-like fashion. Our results demonstrate that our
method significantly outperforms state-of-the-art techniques in the task of
discovering articulated parts of objects. In addition, our part induction is
object-class agnostic and successfully generalizes to new and unseen objects
Web Stereo Video Supervision for Depth Prediction from Dynamic Scenes
We present a fully data-driven method to compute depth from diverse monocular
video sequences that contain large amounts of non-rigid objects, e.g., people.
In order to learn reconstruction cues for non-rigid scenes, we introduce a new
dataset consisting of stereo videos scraped in-the-wild. This dataset has a
wide variety of scene types, and features large amounts of nonrigid objects,
especially people. From this, we compute disparity maps to be used as
supervision to train our approach. We propose a loss function that allows us to
generate a depth prediction even with unknown camera intrinsics and stereo
baselines in the dataset. We validate the use of large amounts of Internet
video by evaluating our method on existing video datasets with depth
supervision, including SINTEL, and KITTI, and show that our approach
generalizes better to natural scenes
DF-Net: Unsupervised Joint Learning of Depth and Flow using Cross-Task Consistency
We present an unsupervised learning framework for simultaneously training
single-view depth prediction and optical flow estimation models using unlabeled
video sequences. Existing unsupervised methods often exploit brightness
constancy and spatial smoothness priors to train depth or flow models. In this
paper, we propose to leverage geometric consistency as additional supervisory
signals. Our core idea is that for rigid regions we can use the predicted scene
depth and camera motion to synthesize 2D optical flow by backprojecting the
induced 3D scene flow. The discrepancy between the rigid flow (from depth
prediction and camera motion) and the estimated flow (from optical flow model)
allows us to impose a cross-task consistency loss. While all the networks are
jointly optimized during training, they can be applied independently at test
time. Extensive experiments demonstrate that our depth and flow models compare
favorably with state-of-the-art unsupervised methods.Comment: ECCV 2018. Project website: http://yuliang.vision/DF-Net/ Code:
https://github.com/vt-vl-lab/DF-Ne
Rigid-Motion Scattering for Texture Classification
A rigid-motion scattering computes adaptive invariants along translations and
rotations, with a deep convolutional network. Convolutions are calculated on
the rigid-motion group, with wavelets defined on the translation and rotation
variables. It preserves joint rotation and translation information, while
providing global invariants at any desired scale. Texture classification is
studied, through the characterization of stationary processes from a single
realization. State-of-the-art results are obtained on multiple texture data
bases, with important rotation and scaling variabilities.Comment: 19 pages, submitted to International Journal of Computer Visio
- …