270 research outputs found

    Fast Multi-frame Stereo Scene Flow with Motion Segmentation

    Full text link
    We propose a new multi-frame method for efficiently computing scene flow (dense depth and optical flow) and camera ego-motion for a dynamic scene observed from a moving stereo camera rig. Our technique also segments out moving objects from the rigid scene. In our method, we first estimate the disparity map and the 6-DOF camera motion using stereo matching and visual odometry. We then identify regions inconsistent with the estimated camera motion and compute per-pixel optical flow only at these regions. This flow proposal is fused with the camera motion-based flow proposal using fusion moves to obtain the final optical flow and motion segmentation. This unified framework benefits all four tasks - stereo, optical flow, visual odometry and motion segmentation leading to overall higher accuracy and efficiency. Our method is currently ranked third on the KITTI 2015 scene flow benchmark. Furthermore, our CPU implementation runs in 2-3 seconds per frame which is 1-3 orders of magnitude faster than the top six methods. We also report a thorough evaluation on challenging Sintel sequences with fast camera and object motion, where our method consistently outperforms OSF [Menze and Geiger, 2015], which is currently ranked second on the KITTI benchmark.Comment: 15 pages. To appear at IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017). Our results were submitted to KITTI 2015 Stereo Scene Flow Benchmark in November 201

    Combining Features and Semantics for Low-level Computer Vision

    Get PDF
    Visual perception of depth and motion plays a significant role in understanding and navigating the environment. Reconstructing outdoor scenes in 3D and estimating the motion from video cameras are of utmost importance for applications like autonomous driving. The corresponding problems in computer vision have witnessed tremendous progress over the last decades, yet some aspects still remain challenging today. Striking examples are reflecting and textureless surfaces or large motions which cannot be easily recovered using traditional local methods. Further challenges include occlusions, large distortions and difficult lighting conditions. In this thesis, we propose to overcome these challenges by modeling non-local interactions leveraging semantics and contextual information. Firstly, for binocular stereo estimation, we propose to regularize over larger areas on the image using object-category specific disparity proposals which we sample using inverse graphics techniques based on a sparse disparity estimate and a semantic segmentation of the image. The disparity proposals encode the fact that objects of certain categories are not arbitrarily shaped but typically exhibit regular structures. We integrate them as non-local regularizer for the challenging object class 'car' into a superpixel-based graphical model and demonstrate its benefits especially in reflective regions. Secondly, for 3D reconstruction, we leverage the fact that the larger the reconstructed area, the more likely objects of similar type and shape will occur in the scene. This is particularly true for outdoor scenes where buildings and vehicles often suffer from missing texture or reflections, but share similarity in 3D shape. We take advantage of this shape similarity by localizing objects using detectors and jointly reconstructing them while learning a volumetric model of their shape. This allows to reduce noise while completing missing surfaces as objects of similar shape benefit from all observations for the respective category. Evaluations with respect to LIDAR ground-truth on a novel challenging suburban dataset show the advantages of modeling structural dependencies between objects. Finally, motivated by the success of deep learning techniques in matching problems, we present a method for learning context-aware features for solving optical flow using discrete optimization. Towards this goal, we present an efficient way of training a context network with a large receptive field size on top of a local network using dilated convolutions on patches. We perform feature matching by comparing each pixel in the reference image to every pixel in the target image, utilizing fast GPU matrix multiplication. The matching cost volume from the network's output forms the data term for discrete MAP inference in a pairwise Markov random field. Extensive evaluations reveal the importance of context for feature matching.Die visuelle Wahrnehmung von Tiefe und Bewegung spielt eine wichtige Rolle bei dem VerstĂ€ndnis und der Navigation in unserer Umwelt. Die 3D Rekonstruktion von Szenen im Freien und die SchĂ€tzung der Bewegung von Videokameras sind von grĂ¶ĂŸter Bedeutung fĂŒr Anwendungen, wie das autonome Fahren. Die Erforschung der entsprechenden Probleme des maschinellen Sehens hat in den letzten Jahrzehnten enorme Fortschritte gemacht, jedoch bleiben einige Aspekte heute noch ungelöst. Beispiele hierfĂŒr sind reflektierende und texturlose OberflĂ€chen oder große Bewegungen, bei denen herkömmliche lokale Methoden hĂ€ufig scheitern. Weitere Herausforderungen sind niedrige Bildraten, Verdeckungen, große Verzerrungen und schwierige LichtverhĂ€ltnisse. In dieser Arbeit schlagen wir vor nicht-lokale Interaktionen zu modellieren, die semantische und kontextbezogene Informationen nutzen, um diese Herausforderungen zu meistern. FĂŒr die binokulare Stereo SchĂ€tzung schlagen wir zuallererst vor zusammenhĂ€ngende Bereiche mit objektklassen-spezifischen DisparitĂ€ts VorschlĂ€gen zu regularisieren, die wir mit inversen Grafik Techniken auf der Grundlage einer spĂ€rlichen DisparitĂ€tsschĂ€tzung und semantischen Segmentierung des Bildes erhalten. Die DisparitĂ€ts VorschlĂ€ge kodieren die Tatsache, dass die GegenstĂ€nde bestimmter Kategorien nicht willkĂŒrlich geformt sind, sondern typischerweise regelmĂ€ĂŸige Strukturen aufweisen. Wir integrieren sie fĂŒr die komplexe Objektklasse 'Auto' in Form eines nicht-lokalen Regularisierungsterm in ein Superpixel-basiertes grafisches Modell und zeigen die Vorteile vor allem in reflektierenden Bereichen. Zweitens nutzen wir fĂŒr die 3D-Rekonstruktion die Tatsache, dass mit der GrĂ¶ĂŸe der rekonstruierten FlĂ€che auch die Wahrscheinlichkeit steigt, Objekte von Ă€hnlicher Art und Form in der Szene zu enthalten. Dies gilt besonders fĂŒr Szenen im Freien, in denen GebĂ€ude und Fahrzeuge oft vorkommen, die unter fehlender Textur oder Reflexionen leiden aber Ă€hnlichkeit in der Form aufweisen. Wir nutzen diese Ă€hnlichkeiten zur Lokalisierung von Objekten mit Detektoren und zur gemeinsamen Rekonstruktion indem ein volumetrisches Modell ihrer Form erlernt wird. Dies ermöglicht auftretendes Rauschen zu reduzieren, wĂ€hrend fehlende FlĂ€chen vervollstĂ€ndigt werden, da Objekte Ă€hnlicher Form von allen Beobachtungen der jeweiligen Kategorie profitieren. Die Evaluierung auf einem neuen, herausfordernden vorstĂ€dtischen Datensatz in Anbetracht von LIDAR-Entfernungsdaten zeigt die Vorteile der Modellierung von strukturellen AbhĂ€ngigkeiten zwischen Objekten. Zuletzt, motiviert durch den Erfolg von Deep Learning Techniken bei der Mustererkennung, prĂ€sentieren wir eine Methode zum Erlernen von kontextbezogenen Merkmalen zur Lösung des optischen Flusses mittels diskreter Optimierung. Dazu stellen wir eine effiziente Methode vor um zusĂ€tzlich zu einem Lokalen Netzwerk ein Kontext-Netzwerk zu erlernen, das mit Hilfe von erweiterter Faltung auf Patches ein großes rezeptives Feld besitzt. FĂŒr das Feature Matching vergleichen wir mit schnellen GPU-Matrixmultiplikation jedes Pixel im Referenzbild mit jedem Pixel im Zielbild. Das aus dem Netzwerk resultierende Matching Kostenvolumen bildet den Datenterm fĂŒr eine diskrete MAP Inferenz in einem paarweisen Markov Random Field. Eine umfangreiche Evaluierung zeigt die Relevanz des Kontextes fĂŒr das Feature Matching

    Stereo matching with temporal consistency using an upright pinhole model

    Get PDF
    Stereo vision, as a subfield of computer vision, has been researched for over 20 years. However, most research efforts have been devoted to single-frame estimation. With the rising interest in autonomous vehicles, more attention should be paid to temporal consistency within stereo matching as depth matching in this case will be used in a video context. In this thesis, temporal consistency in stereo vision will be studied in an effort to reduce time or increase accuracy by utilizing a simple upright camera model. The camera model is used for disparity prediction, which also serves as initialization for different stereo matching frameworks such as local methods and belief propagation. In particular, this thesis proposes a new algorithm based on this model and sped-up patchMatch belief propagation (SPM-BF). The results have demonstrated that the proposed method can reduce computation and convergence time.Ope

    The compositional character of visual correspondence

    Get PDF
    Given two images of a scene, the problem of finding a map relating the points in the two images is known as the correspondence problem. Stereo correspondence is a special case in which corresponding points lie on the same row in the two images; optical flow is the general case. In this thesis, we argue that correspondence is inextricably linked to other problems such as depth segmentation, occlusion detection and shape estimation, and cannot be solved in isolation without solving each of these problems concurrently within a compositional framework. We first demonstrate the relationship between correspondence and segmentation in a world devoid of shape, and propose an algorithm based on connected components which solves these two problems simultaneously by matching image pixels. Occlusions are found by using the uniqueness constraint, which forces one pixel in the first image to match exactly one pixel in the second image. Shape is then introduced into the picture, and it is revealed that a horizontally slanted surface is sampled differently by the two cameras of a stereo pair, creating images of different width. In this scenario, we show that pixel matching must be replaced by interval matching, to allow intervals of different width in the two images to correspond. A new interval uniqueness constraint is proposed to detect occlusions. Vertical slant is shown to have a qualitatively different character than horizontal slant, requiring the role of vertical consistency constraints based on non-horizontal edges. Complexities which arise in optical flow estimation in the presence of slant are also examined. For greater robustness and flexibility, the algorithm based on connected components is generalized into a diffusion-like process, which allows the use of new local matching metrics which we have developed in order to create contrast invariant and noise resistant correspondence algorithms. Ultimately, it is shown that temporal information can be used to assign correspondences to occluded areas, which also yields ordinal depth information about the scene, even in the presence of independently moving objects. This information can be used for motion segmentation to detect new types of independently moving objects, which are missed by state-of-the-art methods

    Real-Time Virtual Viewpoint Generation on the GPU for Scene Navigation

    Full text link

    A holistic approach to structure from motion

    Get PDF
    This dissertation investigates the general structure from motion problem. That is, how to compute in an unconstrained environment 3D scene structure, camera motion and moving objects from video sequences. We present a framework which uses concatenated feed-back loops to overcome the main difficulty in the structure from motion problem: the chicken-and-egg dilemma between scene segmentation and structure recovery. The idea is that we compute structure and motion in stages by gradually computing 3D scene information of increasing complexity and using processes which operate on increasingly large spatial image areas. Within this framework, we developed three modules. First, we introduce a new constraint for the estimation of shape using image features from multiple views. We analyze this constraint and show that noise leads to unavoidable mis-estimation of the shape, which also predicts the erroneous shape perception in human. This insight provides a clear argument for the need for feed-back loops. Second, a novel constraint on shape is developed which allows us to connect multiple frames in the estimation of camera motion by matching only small image patches. Third, we present a texture descriptor for matching areas of extended sizes. The advantage of this texture descriptor, which is based on fractal geometry, lies in its invariance to any smooth mapping (Bi-Lipschitz transform) including changes of viewpoint, illumination and surface distortion. Finally, we apply our framework to the problem of super-resolution imaging. We use the 3D motion estimation together with a novel wavelet-based reconstruction scheme to reconstruct a high-resolution image from a sequence of low-resolution images
    • 

    corecore