389 research outputs found
Virtual Occlusions Through Implicit Depth
For augmented reality (AR), it is important that virtual assets appear to 'sit among' real world objects. The virtual element should variously occlude and be occluded by real matter, based on a plausible depth ordering. This occlusion should be consistent over time as the viewer's camera moves. Unfortunately, small mistakes in the estimated scene depth can ruin the downstream occlusion mask, and thereby the AR illusion. Especially in real-time settings, depths inferred near boundaries or across time can be inconsistent. In this paper, we challenge the need for depth-regression as an intermediate step. We instead propose an implicit model for depth and use that to predict the occlusion mask directly. The inputs to our network are one or more color images, plus the known depths of any virtual geometry. We show how our occlusion predictions are more accurate and more temporally stable than predictions derived from traditional depth-estimation models. We obtain state-of-the-art occlusion results on the challenging ScanNetv2 dataset and superior qualitative results on real scenes
Recommended from our members
Understanding the Dynamic Visual World: From Motion to Semantics
We live in a dynamic world, which is continuously in motion. Perceiving and interpreting the dynamic surroundings is an essential capability for an intelligent agent. Human beings have the remarkable capability to learn from limited data, with partial or little annotation, in sharp contrast to computational perception models that rely on large-scale, manually labeled data. Reliance on strongly supervised models with manually labeled data inherently prohibits us from modeling the dynamic visual world, as manual annotations are tedious, expensive, and not scalable, especially if we would like to solve multiple scene understanding tasks at the same time. Even worse, in some cases, manual annotations are completely infeasible, such as the motion vector of each pixel (i.e., optical flow) since humans cannot reliably produce these types of labeling. In fact, living in a dynamic world, when we move around, motion information, as a result of moving camera, independently moving objects, and scene geometry, consists of abundant information, revealing the structure and complexity of our dynamic visual world. As the famous psychologist James J. Gibson suggested, āwe must perceive in order to move, but we also must move in order to perceiveā. In this thesis, we investigate how to use the motion information contained in unlabeled or partially labeled videos to better understand and synthesize the dynamic visual world.
This thesis consists of three parts. In the first part, we focus on the āmove to perceiveā aspect. When moving through the world, it is natural for an intelligent agent to associate image patterns with the magnitude of their displacement over time: as the agent moves, far away mountains donāt move much; nearby trees move a lot. This natural relationship between the appearance of objects and their apparent motion is a rich source of information about the relationship between the distance of objects and their appearance in images. We present a pretext task of estimating the relative depth of elements of a scene (i.e., ordering the pixels in an image according to distance from the viewer) recovered from motion field of unlabeled videos. The goal of this pretext task was to induce useful feature representations in deep Convolutional Neural Networks (CNNs). These induced representations, using 1.1 million video frames crawled from YouTube within one hour without any manual labeling, provide valuable starting features for the training of neural networks for downstream tasks. It is promising to match or even surpass what ImageNet pre-training gives us today, which needs a huge amount of manual labeling, on tasks such as semantic image segmentation as all of our training data comes almost for free.
In the second part, we study the āperceive to moveā aspect. As we humans look around, we do not solve a single vision task at a time. Instead, we perceive our surroundings in a holistic manner, doing visual understanding using all visual cues jointly. By simultaneously solving multiple tasks together, one task can influence another. In specific, we propose a neural network architecture, called SENSE, which shares common feature representations among four closely-related tasks: optical flow estimation, disparity estimation from stereo, occlusion detection, and semantic segmentation. The key insight is that sharing features makes the network more compact and induces better feature representations. For real-world data, however, not all an- notations of the four tasks mentioned above are always available at the same time. To this end, loss functions are designed to exploit interactions of different tasks and do not need manual annotations, to better handle partially labeled data in a semi- supervised manner, leading to superior understanding performance of the dynamic visual world.
Understanding the motion contained in a video enables us to perceive the dynamic visual world in a novel manner. In the third part, we present an approach, called SuperSloMo, which synthesizes slow-motion videos from a standard frame-rate video. Converting a plain video into a slow-motion version enables us to see memorable moments in our life that are hard to see clearly otherwise with naked eyes: a difficult skateboard trick, a dog catching a ball, etc. Such a technique also has wide applications such as generating smooth view transition on a head-mounted virtual reality (VR) devices, compressing videos, synthesizing videos with motion blur, etc
P3Depth: Monocular Depth Estimation with a Piecewise Planarity Prior
Monocular depth estimation is vital for scene understanding and downstream
tasks. We focus on the supervised setup, in which ground-truth depth is
available only at training time. Based on knowledge about the high regularity
of real 3D scenes, we propose a method that learns to selectively leverage
information from coplanar pixels to improve the predicted depth. In particular,
we introduce a piecewise planarity prior which states that for each pixel,
there is a seed pixel which shares the same planar 3D surface with the former.
Motivated by this prior, we design a network with two heads. The first head
outputs pixel-level plane coefficients, while the second one outputs a dense
offset vector field that identifies the positions of seed pixels. The plane
coefficients of seed pixels are then used to predict depth at each position.
The resulting prediction is adaptively fused with the initial prediction from
the first head via a learned confidence to account for potential deviations
from precise local planarity. The entire architecture is trained end-to-end
thanks to the differentiability of the proposed modules and it learns to
predict regular depth maps, with sharp edges at occlusion boundaries. An
extensive evaluation of our method shows that we set the new state of the art
in supervised monocular depth estimation, surpassing prior methods on NYU
Depth-v2 and on the Garg split of KITTI. Our method delivers depth maps that
yield plausible 3D reconstructions of the input scenes. Code is available at:
https://github.com/SysCV/P3DepthComment: Accepted at CVPR 202
DeMoN: Depth and Motion Network for Learning Monocular Stereo
In this paper we formulate structure from motion as a learning problem. We
train a convolutional network end-to-end to compute depth and camera motion
from successive, unconstrained image pairs. The architecture is composed of
multiple stacked encoder-decoder networks, the core part being an iterative
network that is able to improve its own predictions. The network estimates not
only depth and motion, but additionally surface normals, optical flow between
the images and confidence of the matching. A crucial component of the approach
is a training loss based on spatial relative differences. Compared to
traditional two-frame structure from motion methods, results are more accurate
and more robust. In contrast to the popular depth-from-single-image networks,
DeMoN learns the concept of matching and, thus, better generalizes to
structures not seen during training.Comment: Camera ready version for CVPR 2017. Supplementary material included.
Project page:
http://lmb.informatik.uni-freiburg.de/people/ummenhof/depthmotionnet
Deep Depth From Focus
Depth from focus (DFF) is one of the classical ill-posed inverse problems in
computer vision. Most approaches recover the depth at each pixel based on the
focal setting which exhibits maximal sharpness. Yet, it is not obvious how to
reliably estimate the sharpness level, particularly in low-textured areas. In
this paper, we propose `Deep Depth From Focus (DDFF)' as the first end-to-end
learning approach to this problem. One of the main challenges we face is the
hunger for data of deep neural networks. In order to obtain a significant
amount of focal stacks with corresponding groundtruth depth, we propose to
leverage a light-field camera with a co-calibrated RGB-D sensor. This allows us
to digitally create focal stacks of varying sizes. Compared to existing
benchmarks our dataset is 25 times larger, enabling the use of machine learning
for this inverse problem. We compare our results with state-of-the-art DFF
methods and we also analyze the effect of several key deep architectural
components. These experiments show that our proposed method `DDFFNet' achieves
state-of-the-art performance in all scenes, reducing depth error by more than
75% compared to the classical DFF methods.Comment: accepted to Asian Conference on Computer Vision (ACCV) 201
Mind The Edge: Refining Depth Edges in Sparsely-Supervised Monocular Depth Estimation
Monocular Depth Estimation (MDE) is a fundamental problem in computer vision
with numerous applications. Recently, LIDAR-supervised methods have achieved
remarkable per-pixel depth accuracy in outdoor scenes. However, significant
errors are typically found in the proximity of depth discontinuities, i.e.,
depth edges, which often hinder the performance of depth-dependent applications
that are sensitive to such inaccuracies, e.g., novel view synthesis and
augmented reality. Since direct supervision for the location of depth edges is
typically unavailable in sparse LIDAR-based scenes, encouraging the MDE model
to produce correct depth edges is not straightforward. In this work we propose
to learn to detect the location of depth edges from densely-supervised
synthetic data, and use it to generate supervision for the depth edges in the
MDE training. %Despite the 'domain gap' between synthetic and real data, we
show that depth edges that are estimated directly are significantly more
accurate than the ones that emerge indirectly from the MDE training. To
quantitatively evaluate our approach, and due to the lack of depth edges ground
truth in LIDAR-based scenes, we manually annotated subsets of the KITTI and the
DDAD datasets with depth edges ground truth. We demonstrate significant gains
in the accuracy of the depth edges with comparable per-pixel depth accuracy on
several challenging datasets
- ā¦