57 research outputs found
HITNet: Hierarchical Iterative Tile Refinement Network for Real-time Stereo Matching
This paper presents HITNet, a novel neural network architecture for real-time
stereo matching. Contrary to many recent neural network approaches that operate
on a full cost volume and rely on 3D convolutions, our approach does not
explicitly build a volume and instead relies on a fast multi-resolution
initialization step, differentiable 2D geometric propagation and warping
mechanisms to infer disparity hypotheses. To achieve a high level of accuracy,
our network not only geometrically reasons about disparities but also infers
slanted plane hypotheses allowing to more accurately perform geometric warping
and upsampling operations. Our architecture is inherently multi-resolution
allowing the propagation of information across different levels. Multiple
experiments prove the effectiveness of the proposed approach at a fraction of
the computation required by state-of-the-art methods. At the time of writing,
HITNet ranks 1st-3rd on all the metrics published on the ETH3D website for two
view stereo, ranks 1st on most of the metrics among all the end-to-end learning
approaches on Middlebury-v3, ranks 1st on the popular KITTI 2012 and 2015
benchmarks among the published methods faster than 100ms.Comment: The pretrained models used for submission to benchmarks and sample
evaluation scripts can be found at
https://github.com/google-research/google-research/tree/master/hitne
Multimodal active speaker detection and virtual cinematography for video conferencing
Active speaker detection (ASD) and virtual cinematography (VC) can
significantly improve the remote user experience of a video conference by
automatically panning, tilting and zooming of a video conferencing camera:
users subjectively rate an expert video cinematographer's video significantly
higher than unedited video. We describe a new automated ASD and VC that
performs within 0.3 MOS of an expert cinematographer based on subjective
ratings with a 1-5 scale. This system uses a 4K wide-FOV camera, a depth
camera, and a microphone array; it extracts features from each modality and
trains an ASD using an AdaBoost machine learning system that is very efficient
and runs in real-time. A VC is similarly trained using machine learning to
optimize the subjective quality of the overall experience. To avoid distracting
the room participants and reduce switching latency the system has no moving
parts -- the VC works by cropping and zooming the 4K wide-FOV video stream. The
system was tuned and evaluated using extensive crowdsourcing techniques and
evaluated on a dataset with N=100 meetings, each 2-5 minutes in length
Revisiting Depth Layers from Occlusions
In this work, we consider images of a scene with a moving object captured by a static camera. As the ob-ject (human or otherwise) moves about the scene, it re-veals pairwise depth-ordering or occlusion cues. The goal of this work is to use these sparse occlusion cues along with monocular depth occlusion cues to densely segment the scene into depth layers. We cast the problem of depth-layer segmentation as a discrete labeling problem on a spatio-temporal Markov Random Field (MRF) that uses the motion occlusion cues along with monocular cues and a smooth motion prior for the moving object. We quantitatively show that depth ordering produced by the proposed combination of the depth cues from object motion and monocular occlu-sion cues are superior to using either feature independently, and using a naı̈ve combination of the features. 1
Learned Monocular Depth Priors in Visual-Inertial Initialization
Visual-inertial odometry (VIO) is the pose estimation backbone for most AR/VR
and autonomous robotic systems today, in both academia and industry. However,
these systems are highly sensitive to the initialization of key parameters such
as sensor biases, gravity direction, and metric scale. In practical scenarios
where high-parallax or variable acceleration assumptions are rarely met (e.g.
hovering aerial robot, smartphone AR user not gesticulating with phone),
classical visual-inertial initialization formulations often become
ill-conditioned and/or fail to meaningfully converge. In this paper we target
visual-inertial initialization specifically for these low-excitation scenarios
critical to in-the-wild usage. We propose to circumvent the limitations of
classical visual-inertial structure-from-motion (SfM) initialization by
incorporating a new learning-based measurement as a higher-level input. We
leverage learned monocular depth images (mono-depth) to constrain the relative
depth of features, and upgrade the mono-depth to metric scale by jointly
optimizing for its scale and shift. Our experiments show a significant
improvement in problem conditioning compared to a classical formulation for
visual-inertial initialization, and demonstrate significant accuracy and
robustness improvements relative to the state-of-the-art on public benchmarks,
particularly under motion-restricted scenarios. We further extend this
improvement to implementation within an existing odometry system to illustrate
the impact of our improved initialization method on resulting tracking
trajectories
Low Compute and Fully Parallel Computer Vision with HashMatch
Numerous computer vision problems such as stereo depth estimation, object-class segmentation and fore-ground/background segmentation can be formulated as per-pixel image labeling tasks. Given one or many images as input, the desired output of these methods is usually a spatially smooth assignment of labels. The large amount of such computer vision problems has lead to significant research efforts, with the state of art moving from CRF-based approaches to deep CNNs and more recently, hybrids of the two. Although these approaches have significantly advanced the state of the art, the vast majority has solely focused on improving quantitative results and are not designed for low-compute scenarios. In this paper, we present a new general framework for a variety of computer vision labeling tasks, called HashMatch. Our approach is designed to be both fully parallel, i.e. each pixel is independently processed, and low-compute, with a model complexity an order of magnitude less than existing CNN and CRF-based approaches. We evaluate HashMatch extensively on several problems such as disparity estimation, image retrieval, feature approximation and background subtraction, for which HashMatch achieves high computational efficiency while producing high quality results
- …