247 research outputs found
Optical Flow in Mostly Rigid Scenes
The optical flow of natural scenes is a combination of the motion of the
observer and the independent motion of objects. Existing algorithms typically
focus on either recovering motion and structure under the assumption of a
purely static world or optical flow for general unconstrained scenes. We
combine these approaches in an optical flow algorithm that estimates an
explicit segmentation of moving objects from appearance and physical
constraints. In static regions we take advantage of strong constraints to
jointly estimate the camera motion and the 3D structure of the scene over
multiple frames. This allows us to also regularize the structure instead of the
motion. Our formulation uses a Plane+Parallax framework, which works even under
small baselines, and reduces the motion estimation to a one-dimensional search
problem, resulting in more accurate estimation. In moving regions the flow is
treated as unconstrained, and computed with an existing optical flow method.
The resulting Mostly-Rigid Flow (MR-Flow) method achieves state-of-the-art
results on both the MPI-Sintel and KITTI-2015 benchmarks.Comment: 15 pages, 10 figures; accepted for publication at CVPR 201
Deep Planar Parallax for Monocular Depth Estimation
Recent research has highlighted the utility of Planar Parallax Geometry in
monocular depth estimation. However, its potential has yet to be fully realized
because networks rely heavily on appearance for depth prediction. Our in-depth
analysis reveals that utilizing flow-pretrain can optimize the network's usage
of consecutive frame modeling, leading to substantial performance enhancement.
Additionally, we propose Planar Position Embedding (PPE) to handle dynamic
objects that defy static scene assumptions and to tackle slope variations that
are challenging to differentiate. Comprehensive experiments on autonomous
driving datasets, namely KITTI and the Waymo Open Dataset (WOD), prove that our
Planar Parallax Network (PPNet) significantly surpasses existing learning-based
methods in performance
3D Motion Analysis via Energy Minimization
This work deals with 3D motion analysis from stereo image sequences for driver assistance systems. It consists of two parts: the estimation of motion from the image data and the segmentation of moving objects in the input images. The content can be summarized with the technical term machine visual kinesthesia, the sensation or perception and cognition of motion. In the first three chapters, the importance of motion information is discussed for driver assistance systems, for machine vision in general, and for the estimation of ego motion. The next two chapters delineate on motion perception, analyzing the apparent movement of pixels in image sequences for both a monocular and binocular camera setup. Then, the obtained motion information is used to segment moving objects in the input video. Thus, one can clearly identify the thread from analyzing the input images to describing the input images by means of stationary and moving objects. Finally, I present possibilities for future applications based on the contents of this thesis. Previous work in each case is presented in the respective chapters. Although the overarching issue of motion estimation from image sequences is related to practice, there is nothing as practical as a good theory (Kurt Lewin). Several problems in computer vision are formulated as intricate energy minimization problems. In this thesis, motion analysis in image sequences is thoroughly investigated, showing that splitting an original complex problem into simplified sub-problems yields improved accuracy, increased robustness, and a clear and accessible approach to state-of-the-art motion estimation techniques. In Chapter 4, optical flow is considered. Optical flow is commonly estimated by minimizing the combined energy, consisting of a data term and a smoothness term. These two parts are decoupled, yielding a novel and iterative approach to optical flow. The derived Refinement Optical Flow framework is a clear and straight-forward approach to computing the apparent image motion vector field. Furthermore this results currently in the most accurate motion estimation techniques in literature. Much as this is an engineering approach of fine-tuning precision to the last detail, it helps to get a better insight into the problem of motion estimation. This profoundly contributes to state-of-the-art research in motion analysis, in particular facilitating the use of motion estimation in a wide range of applications. In Chapter 5, scene flow is rethought. Scene flow stands for the three-dimensional motion vector field for every image pixel, computed from a stereo image sequence. Again, decoupling of the commonly coupled approach of estimating three-dimensional position and three dimensional motion yields an approach to scene ow estimation with more accurate results and a considerably lower computational load. It results in a dense scene flow field and enables additional applications based on the dense three-dimensional motion vector field, which are to be investigated in the future. One such application is the segmentation of moving objects in an image sequence. Detecting moving objects within the scene is one of the most important features to extract in image sequences from a dynamic environment. This is presented in Chapter 6. Scene flow and the segmentation of independently moving objects are only first steps towards machine visual kinesthesia. Throughout this work, I present possible future work to improve the estimation of optical flow and scene flow. Chapter 7 additionally presents an outlook on future research for driver assistance applications. But there is much more to the full understanding of the three-dimensional dynamic scene. This work is meant to inspire the reader to think outside the box and contribute to the vision of building perceiving machines.</em
SfM-TTR: Using Structure from Motion for Test-Time Refinement of Single-View Depth Networks
Estimating a dense depth map from a single view is geometrically ill-posed,
and state-of-the-art methods rely on learning depth's relation with visual
appearance using deep neural networks. On the other hand, Structure from Motion
(SfM) leverages multi-view constraints to produce very accurate but sparse
maps, as accurate matching across images is limited by locally discriminative
texture. In this work, we combine the strengths of both approaches by proposing
a novel test-time refinement (TTR) method, denoted as SfM-TTR, that boosts the
performance of single-view depth networks at test time using SfM multi-view
cues. Specifically, and differently from the state of the art, we use sparse
SfM point clouds as test-time self-supervisory signal, fine-tuning the network
encoder to learn a better representation of the test scene. Our results show
how the addition of SfM-TTR to several state-of-the-art self-supervised and
supervised networks improves significantly their performance, outperforming
previous TTR baselines mainly based on photometric multi-view consistency
Learning Depth With Very Sparse Supervision
Motivated by the astonishing capabilities of natural intelligent agents and
inspired by theories from psychology, this paper explores the idea that
perception gets coupled to 3D properties of the world via interaction with the
environment. Existing works for depth estimation require either massive amounts
of annotated training data or some form of hard-coded geometrical constraint.
This paper explores a new approach to learning depth perception requiring
neither of those. Specifically, we train a specialized global-local network
architecture with what would be available to a robot interacting with the
environment: from extremely sparse depth measurements down to even a single
pixel per image. From a pair of consecutive images, our proposed network
outputs a latent representation of the observer's motion between the images and
a dense depth map. Experiments on several datasets show that, when ground truth
is available even for just one of the image pixels, the proposed network can
learn monocular dense depth estimation up to 22.5% more accurately than
state-of-the-art approaches. We believe that this work, despite its scientific
interest, lays the foundations to learn depth from extremely sparse
supervision, which can be valuable to all robotic systems acting under severe
bandwidth or sensing constraints.Comment: Accepted for Publication at the IEEE Robotics and Automation Letters
(RA-L) 2020, and International Conference on Intelligent Robots and Systems
(IROS) 202
Event-aided Direct Sparse Odometry
We introduce EDS, a direct monocular visual odometry using events and frames.
Our algorithm leverages the event generation model to track the camera motion
in the blind time between frames. The method formulates a direct probabilistic
approach of observed brightness increments. Per-pixel brightness increments are
predicted using a sparse number of selected 3D points and are compared to the
events via the brightness increment error to estimate camera motion. The method
recovers a semi-dense 3D map using photometric bundle adjustment. EDS is the
first method to perform 6-DOF VO using events and frames with a direct
approach. By design, it overcomes the problem of changing appearance in
indirect methods. We also show that, for a target error performance, EDS can
work at lower frame rates than state-of-the-art frame-based VO solutions. This
opens the door to low-power motion-tracking applications where frames are
sparingly triggered "on demand" and our method tracks the motion in between. We
release code and datasets to the public.Comment: 16 pages, 14 Figures, Page: https://rpg.ifi.uzh.ch/ed
- …