21 research outputs found

    Revisiting Depth Layers from Occlusions

    Full text link
    In this work, we consider images of a scene with a moving object captured by a static camera. As the ob-ject (human or otherwise) moves about the scene, it re-veals pairwise depth-ordering or occlusion cues. The goal of this work is to use these sparse occlusion cues along with monocular depth occlusion cues to densely segment the scene into depth layers. We cast the problem of depth-layer segmentation as a discrete labeling problem on a spatio-temporal Markov Random Field (MRF) that uses the motion occlusion cues along with monocular cues and a smooth motion prior for the moving object. We quantitatively show that depth ordering produced by the proposed combination of the depth cues from object motion and monocular occlu-sion cues are superior to using either feature independently, and using a naı̈ve combination of the features. 1

    HITNet: Hierarchical Iterative Tile Refinement Network for Real-time Stereo Matching

    Full text link
    This paper presents HITNet, a novel neural network architecture for real-time stereo matching. Contrary to many recent neural network approaches that operate on a full cost volume and rely on 3D convolutions, our approach does not explicitly build a volume and instead relies on a fast multi-resolution initialization step, differentiable 2D geometric propagation and warping mechanisms to infer disparity hypotheses. To achieve a high level of accuracy, our network not only geometrically reasons about disparities but also infers slanted plane hypotheses allowing to more accurately perform geometric warping and upsampling operations. Our architecture is inherently multi-resolution allowing the propagation of information across different levels. Multiple experiments prove the effectiveness of the proposed approach at a fraction of the computation required by state-of-the-art methods. At the time of writing, HITNet ranks 1st-3rd on all the metrics published on the ETH3D website for two view stereo, ranks 1st on most of the metrics among all the end-to-end learning approaches on Middlebury-v3, ranks 1st on the popular KITTI 2012 and 2015 benchmarks among the published methods faster than 100ms.Comment: The pretrained models used for submission to benchmarks and sample evaluation scripts can be found at https://github.com/google-research/google-research/tree/master/hitne

    Multimodal active speaker detection and virtual cinematography for video conferencing

    Full text link
    Active speaker detection (ASD) and virtual cinematography (VC) can significantly improve the remote user experience of a video conference by automatically panning, tilting and zooming of a video conferencing camera: users subjectively rate an expert video cinematographer's video significantly higher than unedited video. We describe a new automated ASD and VC that performs within 0.3 MOS of an expert cinematographer based on subjective ratings with a 1-5 scale. This system uses a 4K wide-FOV camera, a depth camera, and a microphone array; it extracts features from each modality and trains an ASD using an AdaBoost machine learning system that is very efficient and runs in real-time. A VC is similarly trained using machine learning to optimize the subjective quality of the overall experience. To avoid distracting the room participants and reduce switching latency the system has no moving parts -- the VC works by cropping and zooming the 4K wide-FOV video stream. The system was tuned and evaluated using extensive crowdsourcing techniques and evaluated on a dataset with N=100 meetings, each 2-5 minutes in length

    Scribble based interactive 3D reconstruction via scene co-segmentation

    Full text link
    In this paper, we present a novel interactive 3D reconstruction algorithm which renders a planar reconstruction of the scene. We consider a scenario where the user has taken a few images of a scene from multiple poses. The goal is to obtain a dense and visually pleasing reconstruction of the scene, including non-planar objects. Using simple user interactions in the form of scribbles indicating the surfaces in the scene, we develop an idea of 3D scribbles to propagate scene geometry across multiple views and perform co-segmentation of all the images into the different surfaces and non-planar objects in the scene. We show that this allows us to render a complete and pleasing reconstruction of the scene along with a volumetric rendering of the non-planar objects. We demonstrate the effectiveness of our algorithm on both outdoor and indoor scenes including the ability to handle featureless surfaces. Index Terms — image based modeling, interactive 3D re-construction 1

    Learned Monocular Depth Priors in Visual-Inertial Initialization

    Full text link
    Visual-inertial odometry (VIO) is the pose estimation backbone for most AR/VR and autonomous robotic systems today, in both academia and industry. However, these systems are highly sensitive to the initialization of key parameters such as sensor biases, gravity direction, and metric scale. In practical scenarios where high-parallax or variable acceleration assumptions are rarely met (e.g. hovering aerial robot, smartphone AR user not gesticulating with phone), classical visual-inertial initialization formulations often become ill-conditioned and/or fail to meaningfully converge. In this paper we target visual-inertial initialization specifically for these low-excitation scenarios critical to in-the-wild usage. We propose to circumvent the limitations of classical visual-inertial structure-from-motion (SfM) initialization by incorporating a new learning-based measurement as a higher-level input. We leverage learned monocular depth images (mono-depth) to constrain the relative depth of features, and upgrade the mono-depth to metric scale by jointly optimizing for its scale and shift. Our experiments show a significant improvement in problem conditioning compared to a classical formulation for visual-inertial initialization, and demonstrate significant accuracy and robustness improvements relative to the state-of-the-art on public benchmarks, particularly under motion-restricted scenarios. We further extend this improvement to implementation within an existing odometry system to illustrate the impact of our improved initialization method on resulting tracking trajectories

    Fusion4D: Real-time Performance Capture of Challenging Scenes

    Get PDF
    We contribute a new pipeline for live multi-view performance capture, generating temporally coherent high-quality reconstructions in real-time. Our algorithm supports both incremental reconstruction, improving the surface estimation over time, as well as parameterizing the nonrigid scene motion. Our approach is highly robust to both large frame-to-frame motion and topology changes, allowing us to reconstruct extremely challenging scenes. We demonstrate advantages over related real-time techniques that either deform an online generated template or continually fuse depth data nonrigidly into a single reference model. Finally, we show geometric reconstruction results on par with offline methods which require orders of magnitude more processing time and many more RGBD cameras

    RECOVERING DEPTH OF A DYNAMIC SCENE USING REAL WORLD MOTION PRIOR

    No full text
    Given a video of a dynamic scene captured using a dynamic camera, we present a method to recover a dense depth map of the scene with a focus on estimating the depth of the dynamic objects. We assume that the static portions of the scene help estimate the pose of the cameras. We recover a dense depth map of the scene via a plane sweep stereo approach. The relative motion of the dynamic object in the scene however, results in an inaccurate depth estimate. Estimating the accurate depth of the dynamic object is an ambiguous problem since both the depth and the real world speed of the object are unknown. In this work, we show that by using occlusions and putting constraints on the speed of the object we can bound the depth of the object. We can then incorporate this real world motion into the plane sweep stereo framework to obtain a more accurate depth for the dynamic object. We focus on videos with people walking in the scene and show the effectiveness of our approach through quantitative and qualitative results. Index Terms — Computer vision, Image sequences, Image sequence analysis, Depth from vide
    corecore